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Professor  Glenn  W.  Graves,  Chairman 


The  topics  of  maximum  likelihood  estimation  and 
nonlinear  programming  are  developed  thoroughly  with  emphasis 
on  the  numerical  details  of  obtaining  estimates  from  highly 
nonlinear  models. 

Parametric  estimation  is  discussed  with  the  three 
parameter  Weibull  family  of  densities  serving  as  an  example. 
A general  nonlinear  programming  method  is  discussed  for  both 
first  and  second  order  representations  of  the  maximum 
likelihood  estimation,  as  well  as  a hybrid  of  both 
approaches.  A new  class  of  constrained  parametric 
estimators  is  introduced  with  numerical  methods  for  their 
determination. 


Structural  estimation  with  maximum  likelihood  is 
examined,  and  a Bernoulli  regression  technique  is  presented. 
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CHAPTER  I 

STATISTICAL  ESTIMATION 

JU  INTRODUCTION  to  the  dissertation 


This  dissertation  is  concerned  with  a class  of  problems 
of  basic  importance  in  applied  statistics  - the  estimation 
of  parameters  in  a complicated  model  where  simple  closed 
form  estimators  do  not  exist  and  it  is  necessary  to  resort 
to  numerical  methods.  Many  existing  numerical  approaches 
prove  to  be  of  little  practical  value  in  the  context  of 
these  actual  cases  because  of  convergence  problems.  The 
main  purpose  is  to  develop  new  numerical  techniques  by 
combining  recent  developments  in  the  theory  and  practice  of 
optimization  with  statistical  theory  and  to  demonstrate  the 
efficacy  of  these  methods  by  application  to  the  special 
class  of  complicated,  highly  nonlinear  problems  arising  in 
statistical  estimation.  The  applications  are  addressed 
primarily  to  maximum  likelihood  estimation,  and  the  new 
methods  are  compared  where  possible  to  previous  results. 
The  general  numerical  technique  developed  is  also  used  to 
solve  a new  class  of  estimation  problems  with  nonlinear 
constraints  on  the  parameters.  The  numerical  approach  is 
further  utilized  to  provide  an  alternative  to  least  sguares 
regression,  especially  for  problems  with  discrete  dependent 
variables . 

The  present  chapter  reviews  the  mathematical  foundation 
for  statistical  estimation  for  noth  density  functions  and 
structural  models,  and  provides  justification  for  use  of 
maximum  likelihood  estimation.  Chapter  II  presents  a 
history  cf  nonlinear  programming  with  both  search  and  ascent 
methods,  with  emphasis  on  numerical  performance  for  highly 
nonlinear  objective  functions.  Cnapter  III  introduces  the 
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maximum  likelihood  estimation  problem  for  the  parametric 
Weibull  family  of  density  functions.  The  new  techniques  of 
the  dissertation  are  developed  and  demonstrated.  A new 
class  of  constrained  maximum  likelihood  estimators  is 
proposed  with  sample  problems.  Chapter  IV  addresses  a class 
of  regression  models  in  which  the  dependent  variable  is  a 
Bernoulli  observation,  develops  a statistical  theory  for 
solutions  of  the  model  and  gives  a numerical  example. 
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b_.  INTRODUCTION  TO  STATISTICAL  ESTIMATION  THEORY 

A classical  area  of  intense  interest  in  statistics  is 
the  art  of  using  sampling  information  to  make  valid 
inferences  about  unknown  parameters  in  the  distribution  of  a 
population  under  study;  this  body  of  technique,  motivated  by 
the  mathematical  theory  of  statistical  estimation  put  forth 
by  Fisher[86],  can  be  applied  in  several  ways  to  any  given 
sample  producing  various  estimates  of  the  parameters,  and 
leaving  us  with  the  problem  of  selecting  a "good"  estimate 
from  among  the  possibly  infinite  number  of  competitors. 

An  investigator  is  apt  to  feel  that  a "good"  estimate 
is  obviously  that  which  is  closest  to  the  true  parameters. 
However  since  the  estimator  is  a mathematical  function  of 
the  sample  ( a statistic  ) it  is  itself  random  from  sample 
to  sample,  so  that  the  attractiveness  of  a particular 
randomly  distributed  estimator  will  depend  upon  the  long  run 
char,  .teristics  described  by  its  sampling  distribution.  For 
instance,  if  the  sampling  distribution  of  an  estimator  for  a 
parameter  vector  has  a great  deal  of  its  probability 
concentrated  in  a very  small  neighborhood  of  the  true 
parameter,  and  a competing  statistic  does  not,  we  would 
probably  find  the  former  estimator  to  be  "better"  than  the 
latter  for  purposes  of  valid  inference.  That  is,  the 
probability  cf  an  estimate  being  close  to  the  true  parameter 
is  higher  in  the  former  case,  so  we  use  that  particular 
method  with  cur  sampling  information.  Unfortunately,  there 
is  seldom  a guarantee  that  a statistic  will  be  "good"  for 
every  sample,  or  even  that  it  will  produce  useful  or 
intuitively  acceptable  estimates.  Therefore,  one  must 
choose  an  estimator  on  the  basis  of  its  long  run  properties 
relative  to  those  of  feasible  alternative  estimators  and  in 
the  context  of  each  application. 
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In  order  to  formalize  some  of  those  concepts  of 

th 

"goodness,"  let  us  define  the  j observation  of  an 


* i 


m-dimensional  vector,  X^,  as 


X — { x , •••#  x } , j = 1#2,  • • • » n i 
j jl  I® 


with  X row  j of  X,  the  observation  data  matrix. 

j 


It  should  be  made  clear  at  the  outset  that  if  the 
successive  observations  in  X are  not  random,  then  we  must 
know  the  precise  nature  of  the  sampling  procedure  which 
leads  to  this  non-randomness  for  the  observations,  or  very 
little  inference  is  possible.  For  this  reason,  X is  assumed 
here  to  result  from  random  sampling  from  a population  with  a 
single  set  of  parameters,  T. 

For  purposes  of  parametric  estimation,  we  must  know,  or 
have  assumed  hypothetically,  the  precise  mathematical  form 
of  the  distribution  of  each  observation  of  the  parent 
population.  Therefore,  let 


VY1’  • 


represent  tnis  density,  with 


T = (t  , 


• • r t ) t 

k 


a set  of  /.  columns  of  unknown  parameters  to  be  estimated  and 
f non-negative  over  the  region  of  admissible  ranges  of  X 

j j 

and  T. 


— im 


Point  estimation,  then,  is  the  interpretation  of  a 
statistic,  computed  from  X as  a vector  of  constants  which 
can  be  assumed  as  the  inferred  value  of  T;  interval 
estimation  is  the  specification  of  an  interval  such  that  a 
known  proportion  or  such  intervals  contain  the  parameter  T. 


Fol  simplicity  of  exposition,  let  us  assume  that  = f 

for  all  j,  and  momentarily  that  k = 1.  Then,  let  t (n)  be  a 

statistic  to  be  used  as  an  estimator  of  t based  on  a random 

sample  of  size  n.  It  is  reasonable  to  assume  that  the  cost 

of  obtaining  the  sample  is  some  monotonic  increasing 

function  of  n,  and  thus  that  the  economic  justification  of 
/\ 

t (n)  depends  upon  how  •'good'1  it  is  as  a function  of  n.  In 
this  context  some  of  the  following  measures  cf  desirability 
of  estimators  are  proposed  as  functions  of  sample  size,  and 
thus  cost. 


1.  Existence 

It  is  always  necessary  to  be  able  to  demonstrate  that  a 
particular  statistic  exists  with  its  attendant 
properties  for  a given  sample  space,  probability 
distribution,  and  so  forth. 


2.  Simple  Consistency 

A statistic  is  simply  consistent  if  for  any  arbitrarily 
small  positive  constants  c and  d there  is  a sample  size 
N such  that 

Pr[|t(n)  - t|  <*  c]  > 1 - d , r>h  . 
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3.  Squared  Error  Consistency 


A statistic  is  sair  to  have  squared  error  consistency 
if  for  any  arbitrarily  saall  constants  c and  d and  some 
positive  integer  n, 


A 2 

Pr[  (tfa)  - t)  < c]  > 1 - d , n>N 


Some  probabilists  view  these  consistency  properties  as 
special  cases  of  stochastic  convergence  under 
particular  norms.  Both  types  of  consistency  are 
desirable  in  the  sense  of  early  discussion  in  this 
chapter,  producinq  with  high  probability  values  of  t(n) 
in  a small  neighborhood  of  t,  but  consistency  is 
achieved  at  possibly  high  cost. 


4.  Bias 


The  bias,  b(n),  of  the  statistic  t(n)  is  defined 


b(n)  = E[t(n)  - t]  , 


with  E the  expectation  operator.  If  b (n)  =0  for  all 


E[t(n)  ] = t 

A 

and  t (n)  is  said  to  be  unbiased. 

A 

If  b{n)  approaches  zero  as  n increases,  then  t(n)  is 
said  to  be  asymptotically  unbiased. 

Unbiasedness  is  an  intuitively  desirable  point 
property,  but  should  not  be  confused  with  neighborhood 
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properties  such  as  consistency;  neither  property 


inplies  the  other.  Further,  b (n)  can  sometimes  be 

A 

determined,  or  estimated,  and  removed  from  t(n). 


5.  Variance 


Tae  variance  of  a statistic  t (n)  is  defined 


^ XV  2 2 XV  XV  2 

V[t(n)  ] = E[  t(n)  ] - E [ t (n)  ] = E[  (t  (n)  - t - b ( n)  ) ] 


This  may,  or  may  not,  be  analytically  available 

A 

depending  upon  the  mathematical  form  of  t(n)  and  f,  but 


it  is  a characteristic  of  the  sampling  distribution  of 
A A 

t (n)  and  thus  describes  long  range  behavior  of  t(n). 


6.  Mean  Squared  Error 


The  mean  squared  error  of  t(n)  is  defined  as 


A2A  2 

M.S.E.  = E[  (t  (n)  - t)  ] = V[t(n)  3 + b(n) 


He  see  that  the  M.S.E.  and  variance  are  identical  for 
unbiased  statistics,  and  that  for  biased  statistics, 
the  M.S.E.  exceeds  the  variance. 


7.  Likelihood 


For  independent  observations  the  likelihood  of  t (n)  is 
defined  by  Fisher[  86  3 as 


L {X, t)  = f (X  ,t)  ••  *f  (X  ,t)  , 

1 n 
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and  is  regarded  as  proportional  to  the  probability  of 
the  occurrence  of  the  vector,  X,  given  parameter  t. 


8.  Sufficiency 
A 

A statistic,  t(n),  is  said  to  be  sufficient  if  it  can 

be  shown  that  the  conditional  probability  distribution, 

nj  a 

h,  of  any  other  statistic,  t (n) , given  t (n)  does  not 
depend  upon  the  parameter  t : 

(W  A 

h(  t(n)  | t (n)  ) not  a function  of  t . 

A 

Sufficiency  for  t(n)  ijrlies  that  all  the  sample 

A 

information  concerning  t has  been  exhausted  by  t(n). 

Such  statistics  exist  for  a very  important  family  of 
density  functions  including  the  exponential,  binomial, 
chi-square,  gamma,  and  normal  distributions.  As  we 
shall  see,  a straightforward  algorithm  may  be  used  to 
identify  sufficient  statistics 


9.  Completeness 


Let  s (X  ) be  a continuous  function  of  X . If 

j 1 

E[s(X  ) ]=0  for  every  admissible  t implies  that  s (X  ) =0 

j j 

for  all  X then  f (X  ,t)  is  a complete  family  of  density 

j 

functions. 
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10.  Minimum  Ifean  Squared  Error 

It  has  teen  shown  by  Rao(  196  ] and  Cramer[56]  that  under 
assumptions  of  regularity  the  lower  bound  for  M.S.E. 
of  any  statistic  is 

2 2 2 

M.S.E.  = -(1  ♦ db/dt)  /E[t)  ln(L)/iU  ] . 

The  regularity  assumption  disallows  discontinuities  in 
f that  depend  upon  t.  This  bound  may  or  may  not  be 
achievable. 

For  an  unbiased  statistic,  this  lower  bound  for 
variance  is 

N2  ,2 

M.  V.  = - 1/E[  o In  (L)  /dt  ] . 


11.  Squared  Error  Efficiency 

A . 

A statistic,  t(n)  , is  relatively  efficient  if  its 

fit 

M.S.E.  is  less  than  that  of  a competitor,  t (n) , for  a 
given  sample  size: 

A 2 tv  2 

E[  (t  (n)  - t)  ] < E(  (t  (n)  - t)  ] . 

We  can  also  treat  this  as  an  asymptotic  property  of  an 

estimator.  If  the  inequality  ultimately  holds  for  any 

A 

competitor  we  simply  say  that  t(n)  is  asymptotically 
eff icient. 

This  is  a very  appealing  relative  measure  of  the 
"goodness"  of  a statistic.  It  seems  reasonable  to 
assume  that  the  cost  associated  with  an  error  in 
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estimation  is  an  increasing  nonlinear  function  of  the 
size  of  the  error.  For  example,  the  effect  of  a small 
error  might  well  be  unimportant.  A large  error,  on  the 
other  hand,  might  lead  to  significant  costs  due  to 
incorrect  decisions  based  on  the  estimate.  The  precise 
cost-error  relationship  would  be  most  difficult  to 
specify  mathematically.  Assuming  that  the  cost  is  a 
quadratic  function  of  estimation  error  gives  a cost 
function  that  is  tractable  mathematically,  and  weights 
larger  estimation  errors  more  heavily  than  small 
errors.  Thus,  with  this  assumption,  a choice  of 
estimators  on  the  basis  of  relative  efficiency  becomes 
a choice  of  minimum  expected  cost. 


12.  Uniqueness 

For  purposes  of  inference,  it  is  desirable  but  often 
impossible  to  demonstrate  that  the  statistic  used 
uniquely  satisfies  its  own  definition. 

11.  Asymptotic  Normality 

An  estimator  is  asymptotically  normal  if  its  sampling 

distribution  approaches  normality  with  increasing 

sample  size.  This  property  gives  a statistical 

foundation  for  making  the  probability  assertions 

required  for  interval  estimation;  it  obviates  the  need, 

case-by-case,  to  treat  a statistic  as  a mathematical 

transformation  applied  to  the  random  variables  in  each 

sample  and  attempt  to  use  statistical  transformation 

A 

methods  to  derive  a sampling  distribution  for  t(I,n)  in 
closed  form.  In  fact,  such  an  analytic  derivation  is 
frequently  mathematically  impossible. 
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To  use  the  property  of  asymptotic  normality  foi 
rnterval  estimation,  we  require  knowledge  of  the  first 
^ two  moments  of  the  estimator  so  that  the  parameters  of 

the  normal  distribution  may  be  obtained^  4 ].  In  some 
instances  these  cannot  be  obtained  analytically,  as  is 
shown  by  Mann,  Schafer,  and  Singpur wa lla[ 16 1 , p . 263  ]. 


14.  Best  Asymptotic  Normality 

A 

A statistic,  t (n) , is  Best  Asymptotically  Normal, 

A 

B.A.N.,  if  it  is  simply  consistent  and  t (n)  - t 
approaches  a normal  distribution  with  zero  mean  and  a 
variance  less  than  that  of  any  competitor  with 
asymptotic  normality  over  the  same  open  interval  for  t. 
(In  his  introduction  of  B.A.N.  estimators,  Neyman(172] 
gives  a more  general  set  or  existence  conditions  in  the 
context  of  continuous  data  grouped  into  classes.)  Note 
that  B.A.N.  estimators  are  not  necessarily  efficient, 
or  unique,  but  that  they  are  asymptotically  unbiased, 
and  of  course  offer  the  advantages  of  asymptotic 
normality  previously  discussed. 


Finally,  with  suitable  notation  adjustments,  all  these 
characteristics  of  point  estimators  generalize  to  the 
multidimensional  estimation  case,  k > 1 . For  instance,  the 
variance  should  be  notationally  replaced  with  a 
variance-covariance  matrix,  f. 

A A 

¥ = E[  (T  ( n)  -I)  (T(n)-T)  * ] 
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Ci  CHOICE  OF  ESTIMATOR. . . DENSITY  FUNCTIONS 

The  art  in  statistical  estimation  is  as  much  the  choice 
of  an  estimator  as  its  mathematical  derivation  for  a given 
problem.  Although  a myriad  of  estimation  techniques  have 
been  proposed  in  the  literature,  only  those  generally 
applicable  to  the  problems  to  be  considered  here  are 
introduced.  Noted  by  their  absence  are  Bayes  estimators, 
formulated  from  his  idea[15]  of  using  prior  information,  but 
which  do  not  apply  to  a constant  vector,  T,  and  exist  only 
for  very  restricted  choices  of  prior  multivariate  density 
for  T,  and  Minimum  Chi-Square  estimators,  K.C.S.,  discussed 
at  lenqth  by  Rao[196],  which  apply  to  continuous  data 
grouped  into  classes,  and  are  very  similar  in  both 

determination  and  asymptotic  properties  to  the  maximum 
likelihood  estimators,  which  are  presented  shortly. 


Moment  Estimators,  T(n),  proposed  by  Pearson[  1 33  ],  are 

formed  by  equating  the  sample  moments  of  X with  its 
theoretical  moments  stated  in  terms  of  the  parameters,  T. 

The  solution  for  T (n)  may  not  be  possible  in  closed  form  for 

• 

many  density  functions,  and  T(n)  is  not  necessarily  unique 
for  any  given  sample,  however  Pearson  introduced  a large 

fann-ly  of  special  distributions  which  yield  solutions  for 

• 

T(n).  Moment  estimators  are  usually  consistent  in  both  the 
simple  and  squared  error  sense,  asymptotically  normal,  but 
not  b.A.N.,  and  can  be  efficient  only  when  the  variance  of 
X dominates  higher  moments  of  f,  as  is  true  with  the  normal 

• 

distribution.  In  general,  T(n)  has  few  advantages  over 
common  competitors  for  any  particular  density  function,  t. 
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anu  the  disturbing  habit  or  frequently  producing 
outrageously  bad  estimates,  even  inadmissible  ones.  The  use 
of  moment  estimators  by  Pearson  and  others  has  been  largely 
restricted  to  the  more  specialized  problem  of  choosing  both 
a mat hematical  form  for  f when  none  is  known,  and  estimation 
of  resulting  parameters. 

v 

Sufficient  statistics,  T(n),  have  been  demonstrated  by 
FirK^i[3o]  and  Neyman[174]  to  exist  for  any  density  for 
which  the  likelihood  function  may  be  partitioned  into 

v 

L(X,T)  = H (T  (n)  , T)  K (X) 

v 

with  H an  exclusive  nontrivial  function  of  T(n)  and  T,  and  K 

free  of  terms  or  constraints  involving  T.  A condition 

v 

implying  existence  of  a sufficient  statistic,  T(n),  is  that 
f belong  to  the  Koopman-Pit man  exponential  iaraily[  14  2, 18t>  ] 
such  that  f may  be  stated 

f(X.T)  = expt'p  (T)m(X  ) ♦ s(X.)  ♦ g (T)  ] 
j j D 

with  p (T)  a nontrivial  continuous  function  of  T,  s (X  ) and 

j 

m(X  ) continuous  functions  of  X , dm/dX  * 0,  and  the  range 

j j j 

of  X independent  of  T. 

j 


Sufficient  statistics  are  of  strong  intuitive  appeal 
since  they  demonstrably  use  all  of  the  sample  information 
available.  The  algorithm  for  finding  a sufficient  statistic 
is  straightforward,  leading  immediately  either  to  the 
establishment  of  such  a statistic,  or  a proof  that  no 
sufficient  statistic  exists  [128,  p.  231,  141,  p.  26]. 
Unfortunately,  sufficient  statistics  are  not  necessarily 
consistent,  unbiased,  or  efficient. 
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Any  nontrivial  one-to-one  transformation  of  T (n)  is 
also  sufficient  for  T.  Therefore,  whenever  possible  we 
choose  an  estimator  from  this  infinite  family  of  sufficient 
statistics  in  order  to  achieve  one  or  more  additional 
desirable  properties  such  as  consistency,  minimum  variance, 
or  most  often  unbiasedness. 

A Minimum  Variance  Unbiased  Estimator,  N.V.U.E., 
discussed  by  Rao[195]  and  Blackwell[  24  ],  is  always  a 
function  of  the  sufficient  statistic,  and  is  found  as  the 

conditional  expectation  of  any  statistic  which  is  unbiased 

v 

for  T,  given  the  sufficient  statistic,  T(n).  The  M.V.U.E., 
wnen  it  can  be  derived  via  the  conditional  density  required, 
is  necessarily  simply  and  M.S.E.  consistent,  and  the  most 
efficient  unbiased  statistic  for  any  sample  size.  Further, 
if  the  density  function  is  complete,  the  M.V.U.E.  is 
unigue[  128, p.229].  The  mathematical  details  of  deriving  the 
M.V.U.E.  are  arduous,  but  the  statistic  is  desirable 

especially  fcr  small  samples  where  bias  and/or  M.S.E.  are 
high  for  most  competitors.  A minimum  M.S.E.  statistic 

provides  a tradeoff  by  minimizing  the  sum  of  variance  and 
squared  bias,  and  can  be  preferable  to  the  M.V.U.E.  when 
unbiassedness  is  not  absolutely  essential.  Unfortunately, 
minimum  M.S.E.  statistics  are  only  rarely  derivable  for 
finite  sample  sizes,  and  when  found  often  correspond  to  the 
M.V.U.E.  result.  For  instance,  the  sample  mean  from  a 
normal  distribution  can  be  shown  to  be  both  a M.V.U.E.  and 
minimum  M.S.E.  statistic. 

Maximum  Likelihood  Estimators,  M.L.E.,  suggested  by 

Fisher[86],  are  found  by  maximizing  the  likelihood  function 

L(X,T)  by  choice  of  T.  These  intuitively  appealing 
A 

estimators,  T(n),  can  often  be  derived  in  closed  form  by 

differential  calculus,  and  always  exist  under  mild 

A 

regularity  conditions.  Although  T(n)  is  frequently  biased 
for  small  samples,  it  is  asymptotically  unbiased,  B.A.N., 
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and  simply  and  squared  error  consistent  as  shown  by 
Wald[  225, 224  ].  It  is  also  asymptotically  efficient, 
ultimately  achieves  the  Minimum  variance  bound,  and  can  be 
shown  to  be  a function  of  the  sufficient  statistic,  if  one 
exists.  Even  for  relatively  small  samples,  the  M.L.E.  can 
be  more  efficient  than  the  M.V.U.E.,  as  has  been  shown  by 
Brown  and  Rutemiller[  31  ]. 

M.L.E.  also  have  an  important  invariance  property.  For 
any  non-trivial  function  of  T,  u(T),  with  a single- valued 
inverse, 

A A 
u (T)  = u (T  (n)  ) . 

For  example,  invariance  permits  transformations  to  reduce 
bias  without  sacrifice  of  other  desirable  M.L.E.  properties. 

This  property  is  an  indispensable  tool  in  mathematical 
modelling.  Since  parametric  estimation  is  usually  performed 
only  as  a preliminary  part  of  a larger  investigation, 
invariance  is  crucially  important,  permitting  M.L.  point 
estimates  to  be  unconditionally  introduced  into  any 
admissible  function  of  the  associated  parameters,  with  the 
function  directly  inheriting  all  the  desirable  M.L. 
properties.  This  permits  analysis  of  complex  hierarcnical 
systems  to  be  conducted  in  a straightforward  manner. 
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rmality  for  all  M.L.E.  makes  them  very 
erval  estimation,  especially  in  the 
e.  Unfortunately,  M.L.E.  can  not,  in 
teed  to  be  unique,  although  uniqueness  can 
a case-by-case  basis.  Although  numerical 
M.L.E.  can  at  worst  be  exceedingly 
ctice,  the  "good"  properties  of  these 
em  so  singularly  attractive  in  the  general 
ical  estimation  as  to  motivate  the 
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Dj.  CHOICE  OF  ESTIMATOR. . .STRUCTURAL  MODELS 


Suppose  we  examine  a model  in  which  the  population  mean 
is  not  strictly  a function  of  T,  but  rather  a particular 
mathematical  function  of  the  population  parameters,  T,  and 
some  observed  constants,  X.  If  we  define  our  sampling 
process  to  be  the  measurement,  with  some  random  errcr,  of 
observations,  Y,  from  populations  whose  parameters  depend  on 
| and  T,  then  a problem  which  results  is  the  estimation  of 
the  parameters,  T,  based  on  tne  sample 

{Y  , I)  , 


by  use  of  the  relationship 

I = Y (T,  X,  e)  , 

and  known  information  about  the  nature  of  the  error,  e. 
This  technique  is  known  as  regression. 

One  example  of  such  a model  is  classical  linear 
regression,  where 


Y = XT  + e . 
ru 

Since  Y-XT  (n)  is  the  sample  estimation  error  in  the  model 

n/ 

for  the  estimator  T(n)  , the  usual  approach  to  this 
estimation  is  to  assert  a quadratic  cost  function  and 
minimize  the  scalar  sum  of  squared  deviations 

(Y-XT) » (Y-XT) 

by  choice  of  T.  This  technique  was  first  suggested  for  use 
in  interpolation  of  planetary  data  oy  Legendre[ 1 50  ]. 
Provided  that  X'|  is  non-singular,  which  requires  n > k, 
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this  quadratic  objective  function  has  a unique  solution. 


ro  -1 

T(n)  = (X'X)  X • Y . 


This  Least  Squares,  L.S.,  estimator  is  attractive  to  use  for 
linear  models.  The  L.S.  solution  is  the  best  linear 
unbiased  estimator,  B.L.U.E.,  in  the  sense  that  among  all 
unbiased  linear  combinations  of  X this  estimator  has  minimum 
variance  regardless  of  the  distribution  of  e.  Gauss[95]  has 
shown  that  when  e is  normal  the  L.S.  solution  always 
maximizes  the  joint  density 


f (y  IX  ,T)  •••  f (y  |X  , T)  . 
11  n n 


This  remarkable  demonstration  both  anticipates  the  later 
discovery  of  M.L.E.  and  shows  that  in  the  normal  case,  the 
linear  model  has  a single  solution  which  is  both  L.S.  and 
a.L.E.  The  distributional  theory  for  interval  estimation  in 
the  linear  model  is  presented  by  Cochran[51],  and  is  based 
on  the  unique  class  properties  of  the  multivariate  normal 
density,  which  is  closed  for  affine  transformations, 
convolutions,  and  linear  mixtures  of  normals,  and  the  class 
of  chi-square  distributions  of  quadratic  normal  forms,  which 
is  closed  for  convolutions. 

The  assumption  of  normality  for  e and  linearity  are 
crucial  to  the  L.S.  approach,  since  for  non-normal,  or 
non-linear  models  the  distributional  results  fail.  In  fact, 
the  specification  of  a quadratic  cost  criterion  for  L.S. 
minimization  is  not  necessarily  justifiable  in  all 
applications;  for  instance,  mean  deviation,  or  minimax 
(Tchebychef f ) deviation  might  sometimes  be  more  reasonable. 


A general  M.L.E.  approach  tc  regression  focuses 
attention  on  the  density  of  e to  specify  the  likelihood 
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which  is  maximized  by  choice  of  T.  It  is  not  necessary  to 
derive  L(I|X,T)  from  Y (T,  X,  e)  if  one  can  state  the  density 
directly  as  in  the  case  of  Bernoulli  regression  examined  in 
Chapter  IV.  The  M.L.E.  solution  has  all  the  properties 
under  conditions  mentioned  previously,  regardless  of  the 
form  of  the  model,  although  those  that  are  asymptotic  are 
achieved  more  slowly  for  highly  non-linear  models  or 
extraordinary  distributions  for  e.  Sprott  and 

Kalbfleischf 2 17  ] have  examined  for  some  specific  models  the 
robustness  of  the  assumption  of  asymptotic  normality  made 
for  several  finite  sample  sizes. 


E.  SUMMARY:  JUSTIf ICATION  OF  M.L.E. 


As  we  have  seen,  the  M.L.E.  usually  have,  for  large 
samples,  all  the  desirable  properties  of  an  estimator.  They 
almost  always  exist  under  very  mild  regularity  conditions, 
asymptotically  they  are  consistent,  unbiased,  efficient, 
S.A.N.,  achieve  the  Cramer-Rao  minimum  variance  bound,  and 
they  are  sufficient  statistics  whenever  such  statistics 
exist.  They  can  often  be  derived  in  closed  form  by 
differential  calculus,  and  in  other  cases,  the  estimator  may 
be  solved  for  by  numerical  techniques. 

When  point  estimates  of  functions  of  parameters  are 
required  in  a mathematical  model,  it  is  pointless  to  choose 
estimators  for  their  "good11  properties  unless  the  function 
will  also  possess  those  properties.  In  practice,  the  M.L.E. 
are  the  only  available  estimators  with  so  many  desirable 
properties  that  are  all  invariant  under  such 
transformations.  As  mentioned  earlier,  this  invariance 
property  of  M.L.E.  is  vital  in  complicated  problems  where 
parametric  estimation  is  only  the  first  step  of 
investigation. 

3est  of  all,  M.L.E.  provide  a distributional  basis  for 
interval  estimation  which  does  not  depend  upon  simplifying 
assumptions  such  as  those  required  for  the  L.S.  approach. 
This  is  fortunate,  since  in  models  which  are  non-normal, 
non-linear,  or,  more  often,  both,  the  M.L.E.  provide  the 
only  reasonable  estimation  alternative.  Also,  for  the 
classic  linear  normal  model,  the  M.L.E.  provides  the  L.S. 
solution . 

For  small  samples,  the  M.L.E.  have  many  of  the  good 
estimator  properties,  and  are  often  the  best  statistic 
available.  Their  M.S.E.  is  frequently  the  best  among 
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competitors,  even  for  very  small  sample  sizes.  The  M.L.E. 
are  extremely  useful  in  small  sample  estimation  as  a 
starting  point  for  seeking  better  statistical  estimators  for 
particular  density  functions.  The  M.L.E.  are  always  derived 
by  exactly  the  same  method,  requiring  less  intuition,  skill, 
or  plain  luck  than  the  intricate  schemes  leading  sometimes 
to,  for  instance,  an  M.V.U.E.  In  some  statistics  texts,  in 
fact,  M.L.E.  are  the  only  estimators  introduced  since  they 
are  generally  easy  to  find  and  usually  produce  tetter 
estimates  than  other  methods[  156 , p.  1 62  ]. 


Among  alternative  estimators  for  any  given  problem,  the 
M.L.E.  nearly  always  provide  a very  good  property  set  that 
gets  better  very  quickly  with  increasing  sample  size,  and 
becomes  asymptotically  best.  For  those  cases  in  which  the 
M.L.E.  must  be  determined  numerically,  a potentially 
difficult  nonlinear  programming  problem  results. 
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CHAPTER  II 


SOflER|£AL  UCHHIfiOES  OF  ESTIMATION 


A.  INTRODUCTION  TO  NONLINEAR  NUMERICAL  ESTIMATION 


In  the  previous  chapter  we  have  proposed  a 
statistically  desirable  nonlinear  estimation  method,  M.L.E., 
which  leaves  us  with  the  problem 


MAX  { L(T)  } 
T 


The  form  of  L depends  upon  the  model  used.  Estimation 
of  density  function  parameters  tor  f gives 


L(X,T)  = f (X  ,T)  ***f  (X  ,T)  , 

1 n 


and  estimation  of  parameters  for  a structural  model  gives 


L(Y|X,T)  = f (Y|X  ,T)  •••f  (I|X  ,T)  . 

1 n 


In  either  case,  L is  known  to  be  a highly  nonlinear  function 
of  the  decision  variables,  T.  Since  X and  Y are  treated  as 
constants  in  these  two  models,  they  will  not  be  included  in 
the  further  notation  of  this  chapter,  so  that  both 
estimation  models  may  be  treated  at  once.  Thus 


L(T)  = f (T)  • • *f  (T)  . 

1 n 


Mathematical  constraints  may  be  present  for  the 
parameters.  These  may  be  simple  numerical  range 
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constraints,  upper  and  lower  bounds,  for  instance 


1 < t < u , i=1,  . . . , k , 

i i i 

or  more  complicated  joint  functions  of  T,  equality 
constraints  of  the  form 


91  (T)  = 0 , , 

or  inequality  constraints  such  as 

g (T)  < 0 . 

2 

The  set  of  both  types  of  constraints  is  referred  to 
collectively  as 


g.(T)  = {gJ(T),  g'(T)}  , 
g <T>  <0  . 

We  refer  to  the  conditioned  set  of  all  values  of  T which 

simultaneously  satisfy  the  constraint  set,  g (T) , as  the 

feasible  region  tor  T,  and  values  of  T within  that  region 

are  called  feasible  points.  A particular  constraint  that  is 

exactly  satisfied  by  T (a  row  of  g (T)  exactly  equal  to  zero) 

is  said  to  be  active.  If,  for  all  possible  pairs  of  two 

feasible  points,  T and  T , the  convex  combination 

1 2 


T = aT  + (1  - a)T 
1 2 


0 < a < 1 , 


is  also  feasible,  then  the  feasible  region  is  called  convex. 
For  M.L.E.  problems,  there  are  frequently  simple 
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numerical  bounds  placed  on  T.  These  are  usually  included  to 
insure  the  definition  of  a valid  density  function,  f. 
However,  general  mathematical  constraints  are  seldom 
present.  For  this  reason  we  will  initially  emphasize  the 
unconstrained  M.L.E.  problem  and  the  techniques  available 
for  its  solution. 

The  first  step  in  formulating  an  M.L.E.  problem  for 
solution  is  usually  the  replacement  of  the  likelihood 
function,  L,  by  its  logarithm,  ln[L].  It  is  easy  to  see 
that 

MAX  ( L(T)  } , and  HAX  { ln[L(T)]  } , 

T T 

are  both  achieved  by  the  same  value  of  T,  since  the 
logarithm  is  a aonotonic  increasing  function  of  its 
argument.  The  log-likelihood  function  becomes 

ln[L(T)  ] * ln[f  <T)  ]♦. . . + ln[  f <T)  ] , 

1 n 

This  reformulation  usually  gives  an  alias  for  L(T)  which  is 
a mathematically  simpler  function.  For  instance,  members  of 
the  Koopman-Pitman  family  of  density  functions  are 
remarkably  easier  to  deal  with  in  this  form.  This  is 
advantageous  for  both  analytic  and  numerical  work.  For 
instance,  since  L(T)  is  the  product  of  n sample  likelihoods, 
its  value  for  many  problems,  especially  for  large  n,  can 
numerically  violate  the  expressible  range  of  floating  point 
representation  on  a particular  digital  computer. 

He  henceforth  treat  L (T)  as  the  objective  function,  in 
either  the  likelihood,  or  aliased  log-likelihood  form. 
Further,  we  assume  where  necessary  that  L(T),  and  thus  f (T) , 
are  continuous,  twice  differentiable  functions  of  T at 
interior  points.  This  is  a very  weak  restricting  assumption 
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for  M.L.E.  models,  which  very  seldom  have  discrete 

parameters,  T,  and  rarely  have  non-dif f erentiable  density 
functions  (poles,  etc.)  for  realistic  problems  in  which  M.L. 
estimation  is  attempted.  It  is  not  necessary  in  a 
mathematical  programming  sense  to  emphasize  the  statistical 
relationship  of  the  M.L.E.  and  sample  size,  so  it  is  assumed 
notationally  that 

A A 

T = T (n)  . 


A stationary  point  of  L (T)  is  character izedf 2 1 ] by  the 

A 

necessary  condition  that  the  gradient  vanish  at  T, 

VL  (T)  = dL(T)/dT  I A = 0 . 

T=T 

Necessary  conditions  for  a local  maximum  are  that 

?L(T)  = 0 , 

and  that  the  symmetric  Hessian  matrix. 


H = (h  } = {A  (T)/<H  dt  } , 

i 3 

A 

be  negative  semidefinite  at  a stationary  point,  T;  that  is, 
for  any  vector  z not  identically  zero, 

A 

Z'H  (T)  z <0  . 


A vanishing  gradient  and  negative  definite  Hessian 

A A 

7L(T)*0  6 z'H(T)z  < 0 , 

provide  sufficient  conditions  for  a local  maximum  of  L. 
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If  the  Hessian  can  be  shown  to  be  negative  definite  for 

all  feasible  points  T,  then  L is  said  to  be  concave[  19  ],  and 

A 

a stationary  point,  T,  is  the  unique  global  maximum.  Other 
characterizations  of  stationary  points  of  L are  possible; 
these  other  cases  are  of  little  general  use  and  usually 
require  further  assuaptions  for  identification  of  maxima, 
such  as  higher-order  derivati ves[ 208 ]. 

Characterizations  of  extrema  of  L (T)  in  the  presence  of 
equality  constraints  requires  that  the  gradient  vanish  while 
all  the  equality  constraints  simultaneously  hold. 
Lagrangef 147  ] expressed  these  conditions  by  introducing  an 
r-dimensional  vector  of  arbitrary  multipliers,  , and 

augmenting  the  objective  function  of  the  problem  to  include 

the  constraints,  giving 

MAX  {L  (T)  - u'g  (T)}  , 

T,u  1 1 

which,  as  previously  shown,  is  stationary  if 

_ A A 

7 [ L(T)  - u*g  (T)  ]*0  e r < k , 

T,  u 11 

and  a local  maxima  under  conditions  for  the  Hessian  similar 
to  those  for  the  unconstrained  problem,  but  modified  by  the 
dimensionality  adjustment.  John[135],  and  later  Kuhn  and 
Tucker[  146 ],  have  generalized  the  necessary  conditions  to 
inequality  constraints  as  follows,  letting  be  a vector  of 

multipliers  associated  with  g^(T): 

V [ L (T)  - u'g^T)  ]=0  , 


with 
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complementary 


A A 

g (T)  £0,  u £0,  u'g  (T)  =0  . 

2 2 2 2 

The  last  condition  is  referred  to  as 
slackness. 

For  maximization  problems  subject  to  mixed  constraints, 
with  multipliers  defined 


U*  = {u*.  u^}  , 

necessary  conditions  for  a local  constrained  maximum  are: 

V [L(T)  - u*g  (T)  ] = 0 , 

T 

9^  (T)  *=0,  g^  (T)  <0#  u2<0f  u^g2<?)=0  . 

Local  sufficiency  for  these  conditions  further  requires  that 
the  constrained  objective  function  be  locally  concave,  that 
all  nonlinear  inequality  constraints  be  convex,  and  that  all 
equality  constraints  be  linear.  It  may  be  possible  to 
generalize  local  sufficient  conditions,  subject  tc  the 
Kuhn-Tucker  restrictions,  for  nonlinear  equality 
constraints . 

John[135]  actually  developed  conditions  requiring  that 
the  objective  function  also  have  a multiplier,  and  Kuhn  and 
Tucker[146]  qualified  admissible  constraint  sets  to  those 
without  singularities  on  the  boundary  such  as  an  outward 
pointing  cusp,  or  other  nonlinear  degeneracy;  in  these 
cases,  the  nultiplier  proposed  by  John  is  positive,  and  can 
in  fact  be  normalized  to  unity.  Their  development  defines 
the  Lagrangian  objective  function 
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A * 

and  specifies  that  if  a stationary  point  (T,u  ) is  also  a 


saddle  pcint,  that  is 


A * 

that  under  the  mild  assumptions,  the  point  (T,u  ) is  a 

solution  to  both  the  primal  and  dual  problems,  given 
respectively  at  the  left  and  right  above.  This  also 
suggests  that  methods  for  solution  of  the  primal  problem  can 
sometimes  profit  from  information  gained  by  simply 
examining,  cr  shifting  emphasis  completely  to  the  dual.  We 
might  interpret  the  primal  optimization  process  as 


maximization  subject  to  feasibility  with  respect  to 
constraints  and  the  dual  optimization  process  as 
minimization  of  infeasibility,  subject  to  a stationary 
primal  profit  criterion. 

Further  characterizations  under  varying  sets  of 
assumptions  and  useful  simplifying  qualifications  have  been 
given  by  Mangasarian  and  Fromovitz[  159 ],  Arrow  and 
Enthoven[6],  Arrow,  Hurowicz  and  Uzawa[7],  Kortanek  and 
Evans[143],  and  Wilde[  230,  231  ]. 

A 

For  many  likelihood  functions,  T may  be  determined  in 
closed  form  as  a stationary  point  of  L by  differential 
calculus.  In  such  cases,  demonstration  of  extremality  and 
uniqueness  proceed  directly  by  analytic  means  as  previously 
discussed. 

In  general,  however,  the  stationary  points  of  L must  be 
derived  iteratively  by  the  numerical  methods  of  nonlinear 
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programming.  The  general  fi.L.  estimation  problem  has  rather 
distinctive  features  in  this  respect.  The  number  of 
decision  variables,  or  parameters,  is  usually  very  small, 
seldom  more  that  three  for  density  functions  and  ten  for 
structural  models.  The  objective  function  and  especially 
its  gradient  are  highly  nonlinear,  expensive  to  evaluate 
numerically,  and  difficult  to  compute  precisely.  These 
problems  are  exacerbated  by  large  sample  sizes.  The 
constraints  are  usually  of  relatively  simple  form,  often 
just  numerical  bounds  on  T. 


29 


lljpVPinfpjWWM  PP .« !< 


B_.  METHODS  OF  MUMI^AL  OPTIMIZATION 


The  nonlinear  programming  methods  which  may  be  used  for 
M.L.  estimation  are  all  iterative  schemes  with  the 

following  features.  An  initial  value  of  T,  T , muFt  be 

0 


specified  or  guessed  by  the  investigator.  An  iteration 

mechanism  then  chooses  a step-size  and  direction  for 
determining  the  sequence 


T ,T 
0 1 


I 


T , 
m 


such  that 

L (T. ) > L(T.  ) , i=1,2,  ..  .,  m. 

l l- 1 

Finally,  a set  of  termination  states  is  specified. 
Termination  criteria  commonly  include  a maximum  value  of  m. 
A stalling  criterion  can  be  used  for  tolerance  of 
resolution,  with  d a vector  of  arbitrary  small  constants, 

|T  -T  | < d . 
m m-1 

A performance  criterion  can  be  employed  to  insure  acceptable 
distinguishafcility,  or  marginal  improvement. 


L (T  )-L<T  ) > minimal  gain, 

m m-1 


The  ideal  iteration  scheme  is  a totally  automatic 


l 

t 


I 
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algorithm  in  that  the  global  solution  is  reached  in  a finite 
number  of  steps  without  necessitating  human  intervention. 
Unfortunately,  no  single  method  realistically  qualifies  on 
this  basis,  especially  if  we  define  finiteness  in  terms  of 
exhausting  a reasonable  computer  budget.  Also,  a global 
solution  does  not  always  exist  in  the  strict  sense  for  all 
M.L.  problems.  In  practice,  even  the  attainment  of  a local 
maximum  can  be  delightful. 


A qood  iteration  algorithm  should  not  require  excessive 
computation  time  for  termination.  Neither  should  it  demand 
brilliant  intuition,  or  extraordinary  good  fortune,  on  the 
part  of  the  user.  Problem  specificity  of  good  iteration 
performance  is  also  undesirable,  unless  for  demonstrable 
cause  of  an  apparent  nature  general  enough  to  advise  prior 
choice  of  the  method. 


The  taxonomy  of  iteration  schemes  identifies  direct 
search  methods  as  those  which  achieve  gains  by  experiment 
with  evaluation  of  the  objective  function,  L (T)  . Ascent 
methods,  on  the  other  hand,  require  local  derivative 
information  to  calculate  a priori  where  each  following 
evaluation  of  the  objective  function  should  take  place. 
Ascent  methods  may  be  further  subclassified  as  either  direct 
ascent,  which  seek  immediate  gains  at  each  iteration,  or 
indirect  ascent,  which  seek  at  each  step  to  achieve  the 
necessary  conditions  for  a maximum.  Note  that  ascent 
methods  include  those  using  finite  difference  approximations 
to  derivatives.  Distinguishing  between  these  two 
classifications  is  at  times  most  difficult,  since  the 
systematic  experimental  achievement  of  increases  in  the 
objective  function,  L(T),  by  varying  the  argument,  T,  with  a 
direct  search  scheme  is  highly  suggestive  cf  cognizance  of 
differential  information  indicative  of  an  ascent  method. 
This  interminable  classification  problem  is  obviated  by  the 
plausible  defense  of  nomistic  innocence.  Several  classical 
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techniques  of  both  types  that  are  available  for  finding  t(n) 
when  k=1,  for  instance  golden  section  search,  regula  falsi, 
and  so  f orth[232, 1 93  ],  are  not  discussed  here. 
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C.  DIRECT  SEARCH  METHODS 


The  Hooke-Jeeves  pattern  search  method[129],  perhaps 
the  simplest  technique  known,  is  a maximization  scheme  based 
on  direct  evaluation  of  L (T)  . Given  a starting  point,  T^, 

and  stepsi?e,  the  iteration  sequence  proceeds  by  varying 

each  element  of  T by  one  step  in  each  direction  and 
evaluating  the  objective  function,  keeping  each  respective 
element  of  T at  the  value  which  lead  to  a maximum.  Thus,  T 

1 

will  be  at  a corner  of  the  k-dimensional  hyperrectangle 

defined  by  I *S  . The  scheme  proceeds  similarly  until  no 
0 0 

further  gain  seems  possible,  in  which  case  the  stepsize  is 

halved,  the  process  repeated  to  no  gain,  stepsize  halved 
again,  and  sc  forth  until  termination  is  recognized  within  a 
small  enough  neighborhood. 


Several  heuristic  modifications  have  been  proposed, 
including  a ridge-following  "memory”  tor  acceleration  of 
stepsize  when  an  element  of  s continues  step-to-step  to 
exhibit  no  change  in  sign  while  sequential  gains  are 
made[129],  a sequential  transformation  of  coordinates  in 
order  to  minimize  parameter  interaction  and  separate  the 
effects  of  steps  on  the  approximately  orthonormalized 
problem,  linear  minimizations  along  estimated  conjugate 
directions,  a restart  procedure  for  avoiding  local  minima 
and  stalling,  parallel  tangent  acceleration  suggested  by 
Shah,  Puehler,  and  Kempthornef 2 1 0 ],  quadratic  approximation 
with  an  interpolating  polynomial  over  the  local  search 
lattice,  and  introduction  of  random  numbers  to  avoid  dead 
ends  foi  the  search.  Such  ad  hoc  modifications  are  found  in 
Fletcher[ 87  ], 


Zangwill[  239  ],  Rosenbrock[  206  ],  Powell[190], 
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ana  Davies[66],  who  also  describes  response  surrace  airect 
optimization  schemes  encountered  in  experimental  design. 


The  simplex  method,  introduced  by  Spendley,  Hext,  and 
Himsworth[216  ],  generalized  by  Nelder  and  Mead[171],  and 
generally  referred  to  as  the  simplical  scheme  so  as  not  to 
confuse  it  with  the  linear  programming  algorithm,  uses  k+1 
points  defined  as  a simplex  in  the  k-dimensional  search 
space.  At  each  iteration  a new  point  is  created  to  replace 
the  point  associated  with  the  minimum  value  on  tne  simplex 
by  reflection  of  the  minimum  point  via  a ray  through  the 
centroid  of  the  other  points  over  a distance  determined  by  a 
reflection  constant.  A possible  dimensional  collapse  of  the 
simplex  is  avoided  by  special  logic,  and  acceleration  and 
convergence  are  achieved,  respectively,  by  expansion  of  tne 
maximum  point  on  the  simplex  on  a ray  from  its  centroid,  or 
contraction  of  the  minimum  point  on  the  simplex  on  a ray 
toward  the  centroid. 

This  ingenious  technique  works  much  like  the  pattern 

search  methods  examined  above,  and  will  almost  always 

terminate  eventually  by  converging  to  a local  maxima. 

Modifications  of  the  scheme  are  possible  with  random 

perturbations  to  mitigate  near  linear  dependencies  in  the 

simplex  and  to  avoid  final  convergence  to  a local  maximum. 

Numerical  bounds  can  be  accomodated  on  the  parameters. 

box[27]  found  the  simplical  scheme  superior  to  pattern 

search  and  Rosenbrock 's[  206  ] method,  and  introduced  the 

"complex"  search  method,  which  is  a generalization  of  the 

simplical  scheme  to  admit  a convex  inequality  constraint 

set.  Richardson  and  Kuester[199]  have  published  another 

constrained  simplical  program.  One  weakness  of  the  method 

is  the  requirement  for  an  interior  T , but  Noh[177],  has 

0 

further  generalized  the  complex  search  for  equality 


constraints  and  non-interior  starting  points.  Box  reported 
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that  for  his  simple  models,  objective  function  evaluations 
commonly  required  1000  times  as  long  as  the  complicated  stop 
selection  logic.  Parkinson  and  Hutchinson[ 181  ] discuss  tne 
relative  merits  of  variations  of  the  simplical  approach. 


Although  simplical  schemes  seem  to  work  in  practice, 
even  tor  difficult  problems,  no  acceptable  formal  proof  of 
convergence  has  yet  appeared.  The  theoretical  difficulty 
seems  to  lie  in  (unconstrained)  counter  examples  which  can 
be  constructed  and  for  which  the  method  should  not 
terminate.  For  instance,  see  the  cases  given  by  Shere[211] 
for  the  program  presented  by  Richardson  and  Kuester[ 1 99 ]. 
Realistically,  however,  confrontation  of  such  special  cases 
is  highly  unlikely.  On  the  other  hand,  it  is  true  that 
dimensional  collapse  is  a continuing  theoretical  and 
numerical  hazard  in  the  presence  of  constraints.  Finally, 
it  should  be  noted  that  these  are  scarcely  substantive 
criticisms  of  the  method  when  it  is  used  for  adaptive 
process  control,  as  it  was  originally  intended. 


Direct  search  methods  whicn  attempt  to  reliably  achieve 
global  maxima  have  been  proposed  by  Brooks[29],  Bocharov  and 
Fel'dbaum[  25  ] and  Page[1B0]-  These  treat  the  objective 
function  as  an  unknown  but  deterministic  response  to  the 
argument,  T.  The  optimization  proceeds  by  sequentially 
partitioning  mutually  exclusive  and  exhaustive  regions  for 
interior  T over  which  the  first  two  moments  of  the  objective 
function  are  estimated  to  discriminating  precision  by  random 
samplinq  or  numerical  quadrature  over  a k-di  mer.sional 
lattice,  and  a hypothesis  test  is  performed  to  select  the 
better  region,  which  is  in  turn  bisected  on  the  next  step. 
The  iteration  ceases  when  an  acceptably  small  region  is 
selected . 


It  is  important  to  note  the  difference  oetween  these 
area  evaluation  methods  and  simple  random  point  sampling. 


Without  the  partitioning  scheme  and  sequential  area 
estimation  and  hypothesis  tests,  tnese  methods  degenerate  to 
the  infamous  Las  Vegas  technique. 

Each  area  selection  method  suffers  from  a 
non-parametric  probability  of  excluding  the  region 
containing  the  global  optimum  at  some  intermediate  decision 
step  and  thus  of  unreliably  reporting  a surrogate,  nonlocal 
suboptimal  solution.  Geometric  features  such  as  an  isolated 
peak  with  steep  slopes  and  a shallow  base  can  evade 
detection  and  can  be  caused  by  a poor  choice  of  initial 
feasible  region  for  interior  points. 

Several  authors,  notably  Clough[50],  Cooper[55], 
Hartleyf 1 19 ],  Hartman[ 1 20 ],  Liau,  Hartley,  and  Sielkenf 154  ], 
and  2ak harov[ 238 ] have  developed  statistical  strategies  for 
region  sampling  and  evaluation  and  conducted  experiments 
with  standard  objective  functions.  They  report  limited 
success  in  actual  applications.  None  of  the  applications 
include  a problem  typical  of  M.L.E. 

High  frequency  oscillations  and  other  irregularities 
which  thwart  other  search  techniques  are  smoothed  and  thus 
mollified  by  this  area  approach.  This  smoothing 
characteristic  and  the  academically  appealing  global 
strategy  suggest  the  technique  for  finding  a reasonable 
startinq  domain  for  interior  points  for  some  other  search 
mechanism,  especially  if  the  latter  iteration  converges  only 
in  a close  neighborhood  of  a maximum,  or  if  the  objective 
function  is  pathological.  Some  experimentation  has  shown, 
however,  that  excessive  objective  function  evaluations  were 
necessitated  for  relatively  small,  uncomplicated  sample 
problems. 
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D . AbCENT  METHODS 


Most  indirect  schemes  are  c haract erized  by  an  iteration 
of  the  form 


-1 

T = T ♦ aM  s , i=1,2,  ...,  m , 
i i-1 


with  a positive  scalar  step  length,  a,  an  iteration  matrix 
-1 

M , and  a vector  of  directional  gradient  information,  s. 
For  instance,  the  first-order  method  of  steepest  ascent 
first  described  by  Cauchy[45],  and  later  by  Courant[56], 
Curry[59],  and  Levenberg[  153  ],  uses 

M - I , S = 7L  (T)  , 

and  chooses  the  stepsize  a as  a suitable  positive  constant 
to  increase  L (T)  along  the  ray 

T ♦ al7L(T)  . 
i-1 

a may  oe  chosen  to  produce  a maximum  along  the  ray  by  direct 
evaluation,  regula  falsi,  quadratic  approximation,  or  simply 
to  produce  any  gain.  This  method  ultimately  terminates  at  a 
local  maxima,  but  often  converges  with  slow  performance, 
especially  along  curved  rising  ridges  for  which  it 

hem-stitches  with  agonizing  progress. 

Further  discussion  of  ascent  methods  xs  given  by 
Goidstein[  1 0 1 ] and  Ramsay[194].  Powell[190]  and  Brent[28] 
give  first-order  ascent  schemes  using  difference 

approximations  for  derivatives,  with  due  attention  to  the 
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numerical  and  theoretical  consequences  of  such  substitution. 

A second-order  scheme,  the  Newton-Ra phson  method, 

applies 


M * -H  (T)  , s = VL  (T)  , a = 1 , 


for  which  convergence  termination  depends  upon  negative 
definiteness  of  H(T)  . This  condition  on  B(T)  is  usually 
guaranteed  only  over  a small  neighborhood  satisfying  the 
Lipschitz  condition  discussed  by  Henrici[  123  ],  which  in 
essence  requires  that  L (T)  behave  nearly  linearly  in  the 
vicinity  of  a maxima.  The  rate  or  convergence  for  problems 
that  do  successfully  terminate  is  quadratic  above  the  noise 
level  of  machine  calculations  and  it  follows  rising  ridges 
well.  However,  this  second-order  scheme  is  renowned  for  its 
propensities  to  seek  saddle  points  and  follow  ridges  out  of 
the  vicinity  of  the  feasible  region.  Also,  computing  H can 
be  prohibitively  expensive  and  imprecise  for  L (T)  , 

2 

requiring,  as  it  commonly  does,  k very  extensive  n-sums  of 

complicated  nonlinear  transcendental  terms.  (Not  to  speak 
of  the  debugging  effort  in  checking  program  logic  and 
algenra.)  Goldstein  and  Price[103],  have  suggested 
approximation  of  H by  finite  differences  on  L (T)  in  these 
cases.  trior  analysis  of  the  Newton-Raphson  scheme  is  given 
by  Lancaster[  148 ]. 


Many  methods  have  been  proposed  to  give  convergence 
rates  like  those  of  Newton-Raphson  and  dependability  of 
steepest  ascent.  Usually  these  involve  forming  an  iteration 
matrix,  H,  by  various  means  in  tne  interests  of  assuring 
positive  definiteness  over  the  largest  neighborhood. 


The  conjugate  gradient 
Stief elf  126 ],  applies  an 


method,  invented  by  Hestenes  and 
ingenious  one-step  memory  by 
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modifying  the  steepest  ascent  iteration  to  the  recursion 


s.=  7L(T.)  ♦ <II7MT.)II/II?L(T.  )ll)  S.  , 

1 1 1 1-1  1-1 

with 

S^=  7L(T  )/|  I7L(T  ) I I • 

This  scheme  avoids  the  notorious  hem-stitch  stalling  of  the 
steepest  ascent  method,  even  permitting  finite  convergence 
proofs  for  quadratic  objective  functions.  The 

ortnogonalized  gradient  vectors,  and  the  conjugacy  and 
linear  independence  of  the  steps  is  achieved  at  very  little 
cost,  without  requiring  maintenance  of  second  order 

information,  such  as  H (T)  . Thus,  second  order  convergence 
can  often  be  achieved  at  very  little  additional 

computational  cost.  The  method  was  suggested  for  solving 
linear  systems  by  Hestenes  and  Stiefel[126]  and  implemented 
for  nonlinear  objective  functions  later  by  Fletcher  and 
Reeves(90].  A complete  development  is  given  oy 

Hestenos[  124, 125  ].  A convergence  discussion  and 

modifications  to  the  method  are  given  by  Daniel[60]. 

Fis.ner[8b]  gives  the  second-order  method  of  scoring, 
also  discussed  by  Rao[197],  which  is  specific  to  problems  in 
whicn  a log-likelihood  function  is  maximized,  and  is 
identical  to  the  Newton* Ra phson  method,  except  that  the 

Hessian  is  replaced  by  its  expectation, 

fl  = E[-H(T)  ] , 

where  H is  called  the  information  matrix,  which  Kendall  and 
Stuartf 14 1,  p . 56 ] show  to  always  be  positive  definite.  We 
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see  that  the  final  iteration  matrix  for  this  schem 
is  ths  Cramer-Rao  bound  for  regular  M.L.E.  Van 
Chowdhur y£ 223 ] give  some  computational  examples  a 
some  miner  modifications  for  this  approach.  Tn 
reguiles  a formal  derivation  of  the  expectation  of 
complicated  transcendental  sums  in  the  Hessian  ma 
example  will  serve  to  illustrate  the  scope  of  th 
later. 


A -1 

e,  B(T)  , 

daele  ana 
nd  suggest 
is  metnod 
some  very 
trix.  An 
is  problem 


Both  the  theoretical  and  numerical  pertormance  of  these 
iteration  methods  can  be  improved  by  appropriate  affine 
transformation  of  the  problem.  Kor  instance,  see  the  recent 
in  vest  iqation  of  Arnorfi].  Other  techniques  can  be  applied 
to  insure  positive  definiteness  tor  M.  Various  spectral 
decompositions  of  H may  oe  used.  Determination  of 
eigenvectors  ana  associated  eigenvalues  of  the  real 
symmetric  matrix  H is  possible  by  several  methods  reviewed 
Dy  Schwarz,  Putishauser  and  St  ief elf  209  ],  along  with  S'juai^ 
root  and  Cholesky  decompositions.  Although  diagonalization 
and  orthonormalization  of  H will  eliminate  local  parameter 
interaction,  the  neighborhood  over  which  the  result  holds  is 
quite  small  for  non-quadratic  problems,  making  the 
transfer  station  of  questionable  value  when  performed  at  the 
high  expense  of  the  eigen-analysis.  If  the  condition  number 
of  H is  defined  as  the  ratio  of  the  absolute  values  of  the 
largest  to  the  smallest  eigenvalues,  then  a measure  results 
of  botn  topological  distortion  from  an  idealized 
k-diaiensional  response  sphere  about  T,  and  the  difficulty 
with  wnich  M will  be  accurately  invertedf  1 27,76, 133  ]. 

Advocates  of  the  transformational  approach  nave  even 
proposed  introducing  constraints  on  the  eigenvalues  of  H, 
for  instance,  replacing  negative  eigenvalues  by  tneir 
absolute  values,  and  near-zero  values  by  a small  constant 
was  proposed  by  Greenstadtf  108 ] for  maximization  with  a 
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Newton-Raphson-like  scheme.  With  some  difficulty  we  can 
momentarily  visualize  the  presence  of  a large  condition 
number  implying  the  existence  of  a long  ridge  or  trough 
oriented  with  the  eigenvector  associated  with  the  eigenvalue 
in  the  denominator.  This  is  a good  situation  for  a 
second-order  iteration  scheme  if  the  ridge  is  convex,  wnich 
is  the  case  when  the  eigenvalue  in  the  denominator  of  the 
condition  number  is  positive.  This  eigenvalue  constraint 
method,  and  other  similar  proposals,  attempt  to  mask  the 
concave  ridges  and  saddle  points  which  are  also  attractive 
in  the  second-order  iteration.  Booth  and  Peterson[26  1 
discuss  such  geometric  inference  at  length. 

A reasonable  compromise  is  the  simple  scaling  of  B, 
analgous  to  the  creation  of  a correlation  matrix  from  a 
covariance  matrix.  Let  a scaling  of  M be  performed  by 


M * {■.  ./I® . .■  I } » 
s 13  11  jj 


with  singularities  m =0  replaced  in  the  computation  by  1. 

jj 

This  can  ease  the  burden  of  computing  spectral 

decompositions  for  the  iteration  matrix,  and  it  can  reduce 
internal  loss  of  numerical  precision  in  the  iteration 
scheme. 


In  the  same  vein,  a normalized  gradient  is  sometimes 
applied 


7L  (T)  = 7L  (T)  / 1 !7L(T)  | | , 


to  keep  computations  numerically  stable  and  place  the 
scaiinq  burden  on  the  scalar  stepsize,  a.  Even  though  these 
transformation  methods  are  always  available  and  sometimes 
useful,  they  are  not  emphasized  in  this  presentation  for 
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simplicity . This  is  appropriate  in  part  since  the 
investigator  should  always  take  care  to  reasonably  scale  any 
problem  regardless  of  the  method  employed  to  solve  it. 

Levenburg[  1 53 ] proposes  a scheme  which  has  since  been 
generalized  and  machine  implemented  by  Marguardt£ 164  ].  In 
the  development,  a method  is  sought  which  will  behave  lixe 
steepest  ascent  in  regions  not  local  to  the  solution,  and 
like  Newton-Raphson  when  the  solution  is  approached.  The 
iteration  matrix  is  chosen 

H = -H  (T)  ♦ ml  , 

with  m a positive  constant.  Ke  see  that  no  matter  how 
ill-conditioned  H is,  a suitably  large  choice  for  m will 
give  a numerically  nonsingular  iteration  matrix. 

(The  nonsingularity  of  H is  more  apparent  1 i we 
momentarily  consider  the  convex  comnination 

H « -(l-a)H(T)  ♦ al,  0<a<1  . ) 

For  m=0,  this  Marguardt- Levenbur g heuristic  is  the 
Newton-Raphscn  oetnod,  and  for  m large  this  approaches  the 
steepest  ascent  method.  Marguardt  gives  a heuristic  for 
modifying  m by  a multiplicative  expansion/reduction  factor 
on  the  basis  of  algorithm  performance.  A more  formal  method 
of  determining  o was  later  put  forth  by  Smith  and 
Shanno[212],  along  with  facility  for  handling  linear 
constraints  by  the  projected  gradient  method  of  Rosen[203  ]. 

Marguardt  also  introduces  a useful  termination 
criterion  for  tolerance  of  resolution.  With  "|...|" 
denoting  a k-vector  of  absolute  values,  this  is 
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-b  -3 

|T  -T  | < 10  (I  ♦ 10  ) . 

a m-1  m-1 


This  might  be  restated 


-d  -n 

|T  -T  | < 10  (T  ♦ 10  ) , (d  + n)  In  10  < b, 

m ra- 1 m-1  2 


with  d the  number  of  significant  digits  of  desired 
resolution.  b is  the  number  of  bits  in  the  floating  point 
mantissa  of  the  computer  used,  modified  by  the  noise  level 
for  one  or  two's  complement  arithmetic. 


Another  school  of  thought  attempts  to  achieve 

second-order  convergence  without  evaluating  H at  each  step 
of  the  iteration.  The  iteration  matrix,  H,  is  assiduously, 
and  hopefully,  maintained  as  a negative  definite  substitute 

-1 

for  H . Such  variable  metric  methods,  introduced  by 


DavidonT63],  and  discussed  by  Broyden[35],  are  in  reality 
more  computationally  efficient  indirect  ways  of 
approximating  the  Hessian  matrix  by  dif rer enci ng  as 
suggested  earlier  by  Golstein  and  Price[103].  These 
approaches  work  by  adding  a correction  matrix  at  each  step 


- 1 - 1 

N = H ♦ C , 

“i  i-1 


with  C derived  in  several  ways.  Define 


-1  -1 

AT  = T - T = ah  s = afl  VL  (T)  , 

i i-1 


4(7L(T)  ) = 7L(T  ) - 71  IT.  ) , 
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AS  = A(7L(T)) 


I 


and 


-1 

d = AT  - M As  , 
i-1 


-1 

then  a rank-one  correction  for  the  iteration  matrix,  H , is 

C = dd'/AT'd  ; 

there  aie  others,  for  instance  see  Householder[  130,  p.  123  ]. 

A rank-two  correction  for  the  iteration  matrix, 
developed  by  Davidon,  and  Fletcher  and  Povell[89],  gives 

-1  -1  -1 

C = ATAT'/AT'As  - H AsAs*a  /As'B  As  . 

“i-1  “i-1  “i-1 

An  inverse  rank-one  correction  proposed  by  Powell[191]  and 
Bard[  12  ] uses 


-1 

c * As  - H AT  , 
i-1 


to  give 


C = cc’/AT'c 

Povell[191]  suggests  using 


-1  -1 

M = M + C , 
“i  “i-1 


while  flard  suggests 
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These  rank-one  methods  have  also  ueen  discussed  by 
Greenstadt[  109  1,  Fiacco  and  McCormick[  8 3,  p . 170  ], 

Cantrell[  43  ],  Miele  and  Cantrellf  1 68  ],  Cragg  and  Levy[57], 

For sy the[ 92  ] , Myers[17QJ,  and  many  others,  largely  with  the 
objective  of  finding  a stepsize  with  minimal  expenditure  and 
avoidinq  singularities  in  H.  Lill[155]  presents  a computer 
program  with  some  of  these  features.  Rank- two  and  other 
variable  metric  schemes  have  been  examined  by  Bardfll], 

DavidonT64],  Goldfarbf  99  ],  Matthews  and  Davies[1b5],  Brown 
and  Dennis[33],  and  Broyden£  36 ,229,230  ],  who  gives  evidence 
against  using  transformations  on  the  problem  when  in  a near 
neighborhood  to  the  solution  under  pain  of  st ailing  the 
algorithm.  On  the  other  hand,  Oren  and  Luenbergerf  1 7 8,  1 79  ] 
propose  a self-scaling  variable  metric  class  of  algorithms 
with  claims  of  excellent  performance. 

t 

i 

These  methods  have  been  compared  with  others  intended 
for  the  mere  general  problem  of  solving  a simultaneous  set 
of  nonlinear  equations  by  Barnes[13],  Daniel[6  1],  and 
Broydenf  34, 39  ].  For.  contrast,  it  is  also  instructive  to 
review  earlier  work  by  Davidenkof  62  ],  and  Wolfe[235  ]. 

I 

1 

I 

A further  modification  of  seccnd-order  schemes  is 
introduced  in  two  excellent  papers  by  Stewartf 218 ],  and  Gill 
and  Murray[97],  in  which  the  gradient  is  estimated  by 
differences,  and  sequential  approximations  of  the  Hessian 
are  made  with  great  care  in  an  attempt  to  balance  truncation 

t 

errors,  loss  of  numerical  precision,  and  ill-conditioning  in 

| 

the  iteration  matrix.  These  authors  mention  the  numerical 
singularities  that  can  occur  in  the  iteration  matrix  despite 
theoretical  guarantees  to  the  contrary.  Gill  and  Murray 
propose  the  spectral  decomposition  known  as  Cholesky 
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factorization  for  representing  the  symmetric  Hessian.  For  L 
a lower  triangular  matr^.**  and  D a diagonal  matrix,  the 
factorization  produces 

H = LDJj*  . 

Definiteness  for  H is  then  assured  by  the  careful  monitoring 
of  diagonal  elements  of  L and  D. 

Jones[1J6]  gives  a factorization  for  Karguardt's 
scheme.  Jones,  Ross[207]  and  Bard[12],  give  comparisons  of 
the  various  indirect  iteration  schemes,  finding  the 
Marguardt  and  Davidon-Fletcher-Powell  methods  better  in  most 
test  problems.  Brooks[  30 ] gives  a review  of  ealier 

■ 

unconstrained  methods,  as  do  Dennis[71],  Powell[192], 
Spang[215]  and  Kowalik  and  Osbor ne[  144  ]. 


jb  METHODJ  WITH  CONSTRAINTS 


General  constraints  on  the  optimization  problem  have 

already  been  defined  notationally  along  with 

characterizations  of  optima  under  these  conditions. 

Algorithms  permitting  constraints  are  classifiable  by  the 

adoissable  form  of  the  constraints  and  tne  associated 

objective  function.  For  instance,  a linear  constraint  set 

can  be  treated  with  classical  linear  programming,  L.P., 

methods  if  the  objective  function  is  approximated  linearly. 

Note  that  the  L.P.  includes  mechanisms  for  the 

determination  of  interior  points,  T , given  any  starting 

i 

value  for  T^.  Frank  and  Wolfe[93]  present  such  a 
first-order  algorithm,  for  linearly  approximated  objective 
functions,  stated  for  step  i: 

MAX  VL(T  )'T  , 

T i- 1 i 

which  is  solved  via  a standard  L.P.  step  (treating  \7L(T  ) 

i-  1 

as  a fixed  parameter  vector) , reapproximated,  and  so  forth. 

Other  similar  approaches  to  the  problem  have  been  proposed 
by  Wolfe[236]  who  uses  tne  Kuhn-Tucker  conditions  to 
formulate  a L.P.  for  a quadratic  objective  function,  while 
Beale[16,17]  and  Zangwill[  240  ] imbed  the  objective  function 
evaluation  within  the  L.P.  mechanism.  Non-convex  problems 
have  Deen  approached  similarly  with  decomposition  techniques 
discussed  by  Zangwill[  242  ].  A primal-dual  method  is  given 
by  van  de  Panne  and  Whinston[  222  ]. 

Nonlinear  equality  constraints  may  be  implicitly 
combined  with  the  objective  function  by  the  use  of  Lagrange 
multipliers,  as  discussed  earlier,  to  produce  an 
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For 


? 


unconstrained  equivalent  maximization  problem, 
nonlinear  inequality  constraints,  a set  of  active,  or  basic, 
constraints  is  kept  in  the  objective  function  and  used  via 
the  current  multipliers,  or  their  estimates,  to  give 
feasible  directions  for  eacn  iteration,  eitner  alonq  an 
active  constraint,  or  toward  the  interior  region.  These 
modifications  are  discussed  for  gradient  methods  and 
quadratic  objective  functions  by  Markowitz^  163],  Theil  and 
van  de  Panne[220  ],  and  Lemke[152].  General  problems  with 
mixed  nonlinear  constraints  are  examined  by  Rosen[  203,204  ], 
Davies[65],  Zoutendi  jk[  244  ],  Forsy  the[  9 1 ],  Goldstein[ 100  ], 
and  many  others.  Goldfarb[98]  gives  a generalization  of  the 
Davidon  - Fletcher-Powell  second  order  method  to  accomodate 
mixed  linear  constraints.  Greenstadt[ 107  ] presents  a local 
deflected  gradient  method. 

Nonlinear  constraints  may  also  be  explicitly  added  to 
the  objective  function  by  the  use  of  penalty  functions,  an 
idea  attributed  by  some  to  Courant[56],  recently  suggested 
by  Cartoll[44]  and  generalized  by  Fiacco  and 
McCor mick[ 82 ,8 3, 81  ].  For  example: 

MAX  L (T) 

T 

s.t.  9 (T)  < 0 , g^  (T)  = 0 , 


is  restated  with  "interior”  penalty  functions 

1/2 

MAX  L (T)  ♦ C/g'  (T)  1 - g'  (T)g^  (T)/c 

T 1 2 2 

with  c a scaling  parameter,  and  1 a summing  vector.  As  an 
interior  point  approaches  any  constraint,  the  objective 
function  is  distorted.  This  sequential  unconstrained 
optimization  technique,  S.U.M.T.,  solves  a sequence  of 
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monotonically  less  internally  distorted  problems  by 
decreasing  c to  a noise  level.  We  see  that  a formal  basis 
of  active  constraints  need  not  be  maintained,  although  logic 
should  be  included  to  permit  numerical  evaluation  of  the 
ratios  in  the  objective  function  as  tney  approach 

indeterminate  limits.  Sequential  relaxation  of  the 
penalties  will  ultimately  terminate  with  an  interior 

solution,  or  for  problems  with  active  constraints  in  their 
final  solution,  a termination  occurs  in  a close  neighborhood 
of  the  undistorted  solution.  Great  care  must  be  taken  in 
constructing  the  S.U.M.T.  iteration  so  as  to  properly  scale, 
or  "tune,”  the  constant,  c. 

Zangwill[  24 1 ] gives  an  "exterior"  penalty  function 
formulation 

MAX  L (T)  - eg*  (T)  g (T)  , 

T 

with  g(T)  the  subset  of  constraints  from  g ^ (T)  and  g^CT) 

violated  by  the  current  solution,  and  c a positive  constant 

sequentially  increased  maximization-to-maximization  to  an 
arbitrarily  large  terminal  value.  While  this  method  admits 
any  starting  solution,  , there  is  an  added  burden  of 

maintaining  a current  index  set  for  violated  constraints. 


Many  ether  variations  have  been  proposed  for  penalty 
methods,  notably  by  Camp[42],  Butler  and  Martin[41], 
Goldstein  and  Kripke[102],  Stong[219],  Pomentale[  1 89  ], 
Fiacco(79],  Fiacco  and  Jones[80],  Kowalik,  Osborne  and 
Ryan[  145  ],  and  Beltrami  and  McGill^18]. 

Finally,  cutting  plane  algorithms  introduced  by 
Keiley[140]  for  a linear  objective  function  and  nonlinear 
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constraints,  and  by  Cheney  and  Goldstein[ 47  ] and  Wolfe[237  ] 
for  strictly  concave  objective  function  and  constraints,  and 
a constraint  set  which  is  convex,  involve  successive 
introduction  of  auxilliary  variables  and  constraints  to  a 
sequence  or  linearly  bounded  problems.  Such  strategies  can 
lead  to  cumbersome  dimensionality  and  numerical  overhead 
even  for  relatively  small  problems. 

The  texts  by  Hadley[111],  Fletcher(88]  and 
Mangasar ian[ 158  ],  give  extensive  development  of  the  various 
constrained  algorithms. 


F_.  5UMMARYJ.  AN  EFFICIENT  GENESAL  TECHNIQUE 


Convergence  proofs  are  widely  published  for  most  of  the 
numerical  optimization  methods  presented  thusfar.  For 
instance,  Zangwill[  243  ] develops  several  representative 
theorems,  each  with  its  set  of  simplifying  assumptions  and 
necessary  conditions.  However,  even  for  a "nice"  problem 
(convex,  quadratic,  and  so  forth)  these  mathematical 
demonstrations  all  implicitly  depend  at  some  point  upon 
exact  arithmetic,  and  are  tnus  weakened  by  finite  numerical 
precision  of  floating  point  operations  on  a digital 
computer.  As  an  example,  the  effect  of  numerical,  or 
random,  perturbations  on  an  iteration  matrix  and  thus  its 
inverse  is  largely  a mathematical  problem  that  is  not  well 
understood.  Perhaps  one  is  better  off  to  adopt  a passive 
view.  An  undesirable,  but  nontheless  terminal  state  of  an 
iteration  algorithm  is  always  possible  due  to  mathematical 
and  numerical  instabilities.  This  is  the  motivation  of  the 
"terminal  state"  approach  taken  here,  rather  than  a 
"convergence"  point  of  view. 

The  relative  computational  success  of  an  algorithm  in 
practice  often  becomes  a more  important  criterion  for  its 
selection  than  theoretical  rate-of-con vergence.  Further, 
one  must  usually  trade  off  the  degree  of  automation  of  a 
method  (the  amount  of  monitoring  and  "tinkering"  reguired 
for  each  application)  with  efficiency  stated  either  in  terms 
of  solution  expense  or  the  probability  of  termination  at  a 
stationary  point  that  is  optimal.  In  short,  sufficient 
proof  is  performance,  and  it  is  never  general. 

Along  these  lines,  it  can  be  dangerous  to  attempt  to 
generalize  the  results  of  computational  experiments  on 
"standard"  functions,  such  as  those  discussed  by  Rosen  and 
Suzuki[  205  ],  to  a complicat e«.’  application  (very  nonlinear. 
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high  dimensionality,  etc.)-  One  oi  the  reasons  for  the  lack 
of  literature  on  comparisons  of  algorithm  performance  on 
real  problems  is  the  incredibly  high  cost  of  conducting  such 
competitions,  measured  in  exhausted  computer  accounts  and 
man-hours  expended  in  preparation.  Other  factors 
con  triouting  to  the  paucity  of  published  comparisons  include 
the  sheer  volume  of  data  reguired  to  display  t.he  results 
meaningfully,  and  the  proprietary  nature  of  both  the 
problems  and  implemented  algorithms. 

There  are  some  refreshing  exceptions,  such  as  the 
report  of  Friedman  and  Pinder[94]  wno  state  that  the  complex 
method  performed  better  for  their  application  than  S.U.M.T., 
deflected  gradients,  or  pattern  search. 

Graves[104]  gives  a description  of  a general  nonlinear 
programming  algorithm  developed  and  used  for  several  test 
cases  and  applied  to  a moi.e  complicated  minimum  fuel 
guidance  problem.  Graves  and  Whinston[  1 06  ] present  both 
analytic  evidence,  and  experimental  results  for  convergence 
of  the  method  on  a set  of  problems  given  by  Colville[ 54 ]. 
Hatfield  and  Graves[122]  give  another  favorable  application, 
and  Clasen,  Graves  and  Lu[49]  describe  a set  of  large 
munitions  mix  allocation  problems  upon  which  the  method  is 
successfully  applied.  Hense,  the  efficacy  of  the  Graves 
algorithm  has  been  established  on  noth  theoretical  and 
empirical  grounds  over  a period  of  ten  years'  use. 

Tne  method  is  based  on  the  optimization  of  a sequence 
of  local  linear  programming  problems  arising  from  the  first 
order  approximation  of  the  objective  function  and 
constraints  in  the  neighborhood  of  a current  solution, 
bigniticantly , the  program  has  general  provisions  for  the 
practical  implications  of  certain  numerical  characteristics 
of  finite  precision  digital  computers  and  for  vagaries  in 
the  behavior  of  general  nonlinear  constraint  sets. 
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The  efficient  performance  of  the  method  is  probably 
most  directly  attributable  to  the  speed,  precision  and 
compactness  of  the  linear  programming  algorithm  exercised; 
the  mutual  primal-dual  representation  of  linear  programs 
given  oy  Balinski  and  Gomory[10]  and  extended  and  completely 
implemented  by  Graves[105]  provides  both  impressive  solution 
speed  and  a format  within  which  the  local  linear  problem  may 
be  easily  and  effectively  manipulated  to  deal  with  various 
inconsistencies  and  parameterizations  which  arise  during  the 
course  of  solution  of  a nonlinear  problem. 

Linearly  constrained  problems  are  handled  expeditiously 
by  this  general  nonlinear  method.  Modifications  of  the 
iteration  mechanism  are  facilitated  by  program  features 
permitting,  for  instance,  incorporation  of  a second  order 
representation  of  the  proolem  given  in  £106].  Externally 
supplied  routines  may  be  used  to  monitor  the  progression  of 
solutions,  provide  starting  tableaus  from  previous 
solutions,  perform  specialized  input/output  functions,  and 
so  forth.  As  further  evidence  of  the  adaptability  of  the 
Graves  philosophy  in  nonlinear  programming.  Hat f ieldf 1 2 1 ] 
demonstrates  an  efficient  conjugate  gradient  method  for  a 
linearly  constrained  nonlinear  programming  problem. 

The  general  method  uses  the  fundamental  linear 
approximation  theorem  presented  and  proved  by  3uck[ 40, p. 180 ] 
for  continuous  differentiable  functions,  g (T) . If,  for  some 
and  single  constraint,  g(T), 

g (T  ♦ AT)  = g(T  ) + Vg  (T  ) 'AT  ♦ rem(T  ,AT)  , 

0 0 0 0 


then 


LIM  rem  (T  ,AT)/|AT|  = 0 , 
AT-»0  0 
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The  iteration 


is  approached  uniformly  for  feasible  T^. 
proceeds  by  solution  of  the  local  linear  programming  problea, 


SAX  VL  (T  ) »AT  , 
AT  0 


s.t.  PG(To)«AT  < -g(TQ)  - Ter  , 


with  r a vector  of  positive  constants  representing  the 
directional  linear  approximation  error  estimated  from  the 


most  recent  iteration , and  initialized  r=0. 


is 


parametric  adjustment  constant,  used  to  control  the  rate  of 
solution  of  the  local  problems. 

Three  numerical  bounds  are  imposed  on  the  algorithm. 
The  first  is  an  upper  bound  on  the  variables  of  the  dual 
program 


MIN  Y'[g(T  ) + Kr]  , 

Y 0 

s.t.  Y*VG(T  ) = 7L(T  ) , 


0 

Y > 0 


0 


which  is  stated 


B > Y»  1 . 
1 


This  condition  insures  that  the  optimal  bases  of  the 
sequence  of  local  linear  programs  do  not  approach 

singularity  arbitrarily  closely  while  remaining  nonsingular, 
ana  is  used  in  lieu  of  the  Kuhn-Tucker  constraint 

qualifications. 
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so  that  numerical  range  constraints  on  T may  be  incorporated 

i algebraically  into  solutions  without  inclusion  in  the 

<'  constraint  set,  g (T)  , and  to  preclude  local  numerically 

unbounded  solutions. 


The  last  bound, 

b £ MAX  {|RE«  (Tq,AT)  |}  , 

gives  an  upper  limit  for  the  linear  approximation  remainder 
terms.  This  error  bound  is  used  with  during  the  progress 

of  the  algorithm  to  control  the  parametric  adjustment  for 

infeasibilities  in  local  linear  programs  via  the  constant  k. 

A zero  level,  e,  is  also  provided  as  a "noise"  limit 
for  numerical  computations  within  the  program.  This  is  a 
very  important  feature  in  several  ways.  For  instance,  the 

; •<*  i ■ 

pivotal  transformations  use  e to  control  accumulation  of 
truncation  errors.  Most  important,  constraints  are 
considered  to  be  satisfied  when 

f g (T)  < 0 ♦ e . 

t' 

[■  This  is  a subtle  feature.  Some  thought  about  numerical 

evaluation  of  nonlinear  functions  bounding  the  feasible 
region  reveals  that  apparent  inconsistencies  caused  by  loss 
of  real  precision  could  lead  to  incorrectly  concluding  that 
an  in  feasibility  has  been  encountered,  when  in  fact  T is  in 
a feasible  e-neighborhood  of,  for  instance,  an  equality 
constraint.  Remember,  too,  that  the  local  linear  program 
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r l wilL,  when  finally  applied  to  uiaximiz  in  g the  objective 

f j 

function,  seek  basic  extremal  solutions  on  the  boundary  of 
in-  quality  constraints  as  well.  Thus  this  e-relaxation  is  a 
; f uiidainen tal  technique.  In  the  lexicon  of  Iversonf  1 34  ],  we 

■ * treat  constraint  boundaries  as  being  "fuzzy." 

| j 

A sequence  of  consistent  local  linear  programming 
t problems  is  solved  by  constructively  treating  violated 

linear  approximations  of  constraints  as  objective  functions 

' i 

* in  subprublems.  If  for  some  intermediate  solution  Tc  has 

r been  par .imetr ica lly  reduced  to 


IC  = e/[  (3  + 1)  ] , 


and  tnere  still  remains  a violated  constraint  g (T  ) , 

0 

* 

member  of  active  constraints  g(T  ) with  associated 

0 

* 

variables  Y,  such  that 


not  a 
dual 


* * 

; y*g(T0)  * -g(T)  - e , 

i then  an  unr ^solvable  inconsistency  is  reached  as  a terminal 

state. 

A terminal  optimal  solution  is  recognized  when  a local 
j linear  program  exhibits  a dual  solution  with 

I* g (T  ) > -e  . 

0 

A finite  convergence  proof  for  this  technique  requires. 


K 


5b 
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as  always,  restricting  assumptions  about  the  nonlinear 
functions  in  the  problem.  However,  a terminal  optimal 
solution  to  a local  linear  program  is  a stationary  point  for 
the  oriqinal  objective  function,  L(X,T).  The  possibility  of 
termination  at  a stationary  saddle  point  cannot  be  ruled 
out,  but  experience  has  shown  such  a result  to  be  very  rare 
for  real  problems. 


A second  order  representation  of  the  primal  problem  can 
often  be  expected  to  converge  more  quickly  in  the 
neighoorhcod  of  a stationary  point  than  the  first  order 
"gradient"  formulation.  To  achieve  the  higher  order 
representation,  we  create  an  expanded  nonlinear  program  by 
introducing  the  first  order  stationary  conditions  as 
constraints.  This  expanded  representation  introduces  the 
dual  variables  explicitly  and  uses 


T = {T  ,T] 


so  that  the  reformulation  yields 


* * 

MAX  [?L(T)  - VG(T)'T]'T  + g(T)'T 
T* 


s.  t . 


g(T)  < 0 , 


?L  (T)  - VG  (T)  'T  < 0 , 


T > 0 . 


We  define  H (T)  as  the  three  dimensional  matrix  of  Hessians 
for  the  constraint  set,  with  "column  j"  the  Hessian  of 


gj  (T) 


# * 

Now,  with  T = T^,  the  parameterized  local  linear 


program  becomes 
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# * # 

MAX[7L  (T^)  + (B(T^)-H(T^)T^}T^  j'AT  ♦ [ g (Tq)  * 7fi  (Tq)  Tq  ]'AT 


s.  t.  7G  (T  )AT  < -g  (T  ) - Kr  , 

“0  0 1 

* * * * 

[H(T  ) - H (Tq)  Tq  ]AT  - 7C(Tq)'AT  < -7L(Tq)  + - X 


In  the  special  case  of  unconstrained  problems,  the 
local  linear  program  becomes 

MAX  [7L(T^)  + H (T^)  ]'AT  , 

s.t.  H (T  ) AT  < -7L(T  ) - ler  , 

"0  0 2 

and  the  direction  of  ascent,  neglecting  the  parameterization 
term  Kr  , becomes 


-1 

T = -[H(T  ) ]*7L  (T  } , 

L-  0 J 0 

which  is  the  familiar  Newton-Raphson  result  when  the  Hessian 
is  cf  full  rank. 

It  should  be  noted  that  the  present  linear  programming 
approach  is  more  robust  than  the  classical  Newton-Raphson 
process  precisely  because  it  will  continue  to  function  in 
the  presence  of  a singular  or  near  singular  Hessian.  Also, 
tne  relaxation  features  of  the  method  introduced  via  the 

parameterization  of  the  right  hand  side  with  Kr  generally 

have  a salutary  effect  on  the  rate  of  convergence  for  a 
constrained  first  order  or  any  second  order  representation 
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of  the  problem 


CHAPTER  III 


ESTIMATION  FOB  THE  WEI BOLL  DENSITY  FUNCTION 


A_.  introduction  to  the  parametric  wei bull  family 

A density  function  has  been  proposed  for  describing 
breaking  strength  in  materials  and  later  formally  introduced 
by  Waloddi  Weibull  [ 227,228,229]  for  use  in  fitting  many 
types  of  physical  data  from  various  academic  and  industrial 
fields  of  interest.  It  is  his  reasonable  claim  that  there  is 
very  seldom  sound  theoretical  oasis  for  applying  any 
particular  density  to  real  data.  He  therefore  advises 
choice  of  a relatively  simple  density  function  which  seems 
to  fit  with  empirical  observations,  and  "stick  to  it  as  long 
as  none  better  has  been  found[  229, p.  293  ]. " 

Tne  Weibull  density  was  originally  parameterized 


T'  = { a,  b,  c } , 


and  given  the  form 


b-1  b 

f (x#T)  = (b/a)  (x  - c)  exp(-[(x  - c)  /a]}  ; 


a,  b > 0;  x > c . 


A reparameterization  gives  the  equivalent 


b-1  b 

f <x,T)  = (b/a)  [ (x  - c)/a  ] exp  {-[  (x  - c)/a  ] } ; 


a,  b > 0;  x > c . 
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parameter,  a.  The  chameleonic  nature  of  this  family  is 
discussed  by  Lenma n[  1 5 1 ].  Its  robust  adaptability  for  data 
fitting  have  made  it  a popular  candidate  in  such 
applications.  Indeed,  with  b= 1 the  Weibull  simplifies  to 
the  two  parameter  exponential  family,  and  when  b=2  the 
Rayleigh  family  results.  Figure  2 shows  the  Rayleigh  family 
of  densities  arising  from  the  Weibull  with  b=2,  arbitrary 
location  parameter,  c,  and  several  values  of  the  scale 
parameter,  a. 

By  design,  the  Weibuil  density  is  a perfect  algebraic 
differential,  and  its  reliability  function  is  defined 

T b 

R (x,T)  = / f (x,T)  dx  = exp  {-[  (x  - c) /a  ] } , 

2 x 2 

and  the  distribution  function  follows: 

x b 

F (i:,T)  = < f (x,T)  dx  = 1 - exp  (-[  (x  - c) /a  ] } . 

2 c 2 

Keen  interest  in  the  Weibull  family  comes  from 
reliability  applications  and  the  statistics  of  extremes. 
Gumbelf  1 10  ] gives  a derivation  of  a form  of  the  Weibull 
family  under  the  name  "Type  III  asymptotic  distribution  of 
the  smallest  extreme."  Also,  reliability  theory  leans 
heavily  upon  the  concept  of  "hazard  rate,"  which  is  defined 
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f 


h(x,T)  = f (X,T)  / R (x , T)  . 

2 2 

This  is  interpreted  as  the  instantaneous  failure  rate  of  a 
functioning  electronic  device  or  physical  component  under 
service  stress. 

Many  statistical  studies  are  made  under  hypothetical 
conditions  of  decreasing,  constant,  or  increasing  hazard 
rate.  The  flexible  Weibull  family  can  exhibit  all  three. 
In  fact,  another  derivation  of  the  Weibull  density  comes 
immediately  from  the  assumption  of 


b-1 

h (x,  T)  = (b/a)  [ (x  - c)  /a  ] , 

as  the  mathematical  form  for  the  hazard  rate  function. 

In  an  excellent  introduction  to  reliability  theory, 
Mann,  Schafer  and  Singpurvalla  give  many  references  to 
applications  in  the  open  literature  using  the  Weibull,  and 
state:  "Recently,  the  Weibull  distribution  has  emerged  as 

the  most  popular  parametric  family  of  failure 
distributions."  [161, p. 127] 
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B_.  ESTIMATION  ALTERNATIVES 

The  most  popular  statistical  techniques  for  parametric 
estimation  in  the  Weibull  family  are  graphical  estimation, 
the  method  of  moments,  and  M.L.E.  The  first  method  needs 
little  discussion,  n owever  the  latter  two  demand  some 
mathematical  development  for  proper  evaluation,  especially 
for  the  M.L.E.  For  the  present  we  will  restrict  our 
attention  to  complete  random  samples,  and  the  most  general 
case  of  all  three  parameters  unknown.  The  literature  is 
rife  with  examples  of  estimation  for  subsets  of  the 
parameters,  although  numerical  details  are  often  scarce. 
The  most  prolific  author  on  the  subject  is 

Mann[  160 ; 16 1,p. 185ff . ] who  gives  extensive  references. 
Dubey[  75,73,72,76  J has  also  made  many  contributions.  Also 
see  the  cases  given  by  Menon[166]  and  Smith  and  Dubey[213], 
Generalizations  of  specia’  cases  for  subsets  of  Weibull 
parameters  have  been  given  by  Dubey[77]  and  others.  An 
excellent  discussion  of  the  entire  topic  is  given  by 


Rockette[  201]. 


Graphical  estimation,  used  by  Weibull[229]  and 
described  by  Ber rettoni[  22  ] and  Kao[139]  relies  on  some 
prior  knowledge  of  parameter  values  and  the  tact  that  the 
reliability  function,  in  the  for*  proposed  by  Weibull, 

f 

b 

8 (X/T)  = exp  {-[  (x  - c)  /a}  , 

can  be  transformed  to 

In  {— 1 n[  R ^ (x,T)  ]}  = b In  (x  - c)  - in(a)  . 

1 A value  for  the  location  parameter,  c,  is  asserted  with 
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reference  to  the  first  order  statis.ic  in  .•  sample.  Then 
the  empirical  reliability  function  is  plot.'.;  i <n  a "let-log" 
ordinate  scale  versus  a "log"  abcissa  oi  the  unplaced 
sample  values,  X - c.  If  the  resulting  points  fall  in  a 
nearly  straight  line,  then  b is  estimated  as  its  slope  anc  a 
is  found  from  the  intercept,  In  a.  Otnerwise,  another  valu» 
of  c is  tried,  and  so  forth. 

Obviously,  this  subjective  estimation  method  leaves 
much  to  be  desired  statistically.  However,  it  can  be 
carried  out  with  tools  no  more  formidable  than  an  extensive 
table  of  logarithms,  and  it  has  served  adequately  for 
decades.  Of  course,  a L.S.  approach  to  this  transformed 
problem  is  also  possible,  but  the  results  arc  statistically 
comparable  to  the  manual  method. 

The  method  of  moments  can  be  used  to  estimate  Weibull 
parameters.  Although  a moment  generating  function  tor  the 
Weibull  cannot  be  given  in  closed  form  algebraically,  the 
central  moments  are  defined 


q «•  q 

m 1 = E[  X ] - / X f (x,T)  dx  , 

q c 2 


th 

from  which  we  derive  for  the  q moment  tne  partial  sum 


= .1L  f1]  ^ 1a1p(1ti/b) 


q i=0  \i 


The  tirst  two  moments  about  the  mean  are 


m^  = c t aP(l  + l/b)  , 


anu 


L 


fV- 'vrwff 


'f-1'  J !■■  M','1  — - 


2 2 

a»2  = a [P(1  + 2/b)  - P (1  + </o)  ] . 


Obviously,  it  is  inpossible  to  solve 
explicitly  for  the  parameters,  although 
solution  is  possible  by  elimination  of 
Surprisingly,  however,  the  skewness 


these  equations 
an  iterative 
one  parameter. 


m /m 
3 2 


P(  1+  3/b)  - 3P(  1 + 2/b)P(  1 + 1/b)  + 2P  (1+1/b) 

2 3/2 

[P(U2/b)  - p (1  + 1/b)  ] 


and  kurtosis 
2 

m /m  = 

4 2 

2 4 

P(1+4/b)  - 4p(1+3/b)P(1  + 1/b)  ♦ 6P(1  + 2/b)P  (1  + 1/b)  - 3 P (1  + 1/b) 

cr ( 1+2/b)  - p2  (i+i/b)  j2 


are  strictly  functions  of  the  shape  parameter,  b.  Each  can 
individually  yield  estimates  of  b by  reference  to  a simple 
tabulation  [198],  depending  upon  the  judgement  of  the  data 
analyst. 


Given  b,  one  may  sequentially  obtain  moment  estimators 
of  the  scale  parameter 


• • 2 • 1/2 
a = (m2/[P(1  + 2/b)  - p (1  + 1/b)  ]) 


and  location  parameter 
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c = x - a p(1  + 1/b)  . 

Unfortunately,  these  moment  estimators  qo  not  have  many 
desirable  properties.  Dubev[74]  nas  investigated  the 
efficiency  cl  moment  estimators  for  Weitull  parameters  in 
restricted  cases.  The  most  common  complaint  is  that  the 

• 

estimate  of  location  parameter,  c,  regularly  violates  tne 

simple  numerical  constraints  of  the  definition  of  the 
Weibull  random  variable.  Most  often  this  would  be 

0 < c < x 

[1] 

When  a violation  occurs,  it  is  not  clear  that  any  reasonable 


met  nod 

exists 

for 

adjusting  c and  "backing 

out"  the  change 

through 

the 

other 

moment 

eguations.  The 

most  common 

practice 

in 

t hes  e 

cases 

is  to  replace  c 

by  the  violated 

bound,  and  conveniently  disregard  the  effect  upon  the 
moments  of  the  solution. 


C.  MATHEMATICAL  PRELIMINARIES 


M.L.E.  for  the  parametric  Weibull  family  may  ostensibly 
be  numerically  performed  with  almost  any  of  the 
unconstrained  techniques  introduced  in  Chapter  II. 
Therefore,  in  order  to  compare  the  merits  of  various 
approaches,  the  following  development  gives  the  necessary 
mathematical  basis  for  consideration  of  any  of  the  search  or 
ascent  methods. 


The  log  likelihood  function  for  the  Weibull  family 


f (x,  T)  is 

2 


L (X,T)  = ln[  (b/a)”.'fr  [ (x.  - c)/a]b  1 exp{-[  (x.  - c)/a]b}]  ; 

1=1  1 l 


a,  b > 0,  c > x , 

CU 


with  x the  first  order  statistic.  This  gives 

P] 


L (X ,T)  = n ln[  b/a  ] + (b  - 1)  ^ ln[  (x  - c)  /a  ] 

i = 1 i 

n b 

- X ,[  (X.  - c)/a  ] . 

i=  1 i 


The  gradient  of  the  log  livelihood,  V L(XrT),  has  the  three 

T 

elements 


V = (n/a)  {£  [ (x.  - c)/a]b  - n)  , 
a i=  1 l 


l 

j 

i 

{ 

-! 

i 


1 


•< 

i 

i 
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7 = n/b  + £ lr[  (x  - c)  / a]  - .£.[(».  - c) /a  ] ln[  (x  - c)/a] 

b i=1  i i=1  i 1 

r n b“  1 

7=-(b-  1)  £ [ 1/(x.  - c)  ] ♦ (b/a)  Ll.  [ (x.  - c)/a] 

n 1=1  1 1-11 


The  symmetric  Hessian  matrix  for  tne  family,  f (x,T) 
parameterized  by 

T*  = { a,  b,  c } , 

is  defined  as 


H (T)  = (h  } = b L(X,T)/dt.dt  . , 
i j 1 3 


and  is  given  by 


h = (b/a  ) {n  - (b+1)  .I1C  (x . " c)  /a  1 ) • 
1 1 1=1  i 


h = (1/a)  J-n  + [ (x.  - c)/a] 

12  i^1  i 


n b 

+ b it  [ (X  - c)  /a  ] ln[  (x  . - c)  /a  ]}  , 
i=  1 i 1 


2 n b-1 

h = - (t/a)  r [ (x.  - c)  /a  ] 

13  i=1  i 


2 n 


b 2 


22 


= -n/b  • f[(x  - c)/a]  In  [ (x . - c)  / a]  , 
i=1  i 1 
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1 


23 


- c) 

1 = 1 1 


(1/a)  (b.f:  [ (x.  - c)/a]°  ^nffx.  - c)  / a] 

i=1  l i 


b-1 


♦ .O  (*  . “ c)/a  ] } 

i = l l 


h = - (fc  - 1)  {.£  (x , - c)  2 + (b/a^)  i f (x  - c)/a] 
33  i=1  l 1=1  l 


b-2 


We  see  immediately  that  the  three  parameter  Wcibull 
family,  f (x,T)  , is  not  of  the  Koopman-Pitman  form  admitting 

sutficient  statistics.  Therefore,  search  for  an  M.V.U.E.  is 
pointless. 

Special  cases  of  the  Weibull  family  have  already  been 
mentioned  for  known  b=1  (exponential)  and  b=2  (Rayleigh)  . 
The  former  case  is  covered  exhaustively  in  the  literature. 
M.L.  estimation  for  the  two  parameter  exponential  is 
ncnregular,  requiring  use  of 

A 

c = x 

m 


We  shall  hence  iorth  rule  out  values  of  b<  1 , since  the 
likelihood  function  is  unbounded  as  the  location  parameter, 
c,  approaches  x and  thus  the  Weibull  density  is 

in 

inappropriate  for  use  in  the  M.L.  estimation. 

Rockettef 201  ] analyzes  the  other  cases  with  b>1,  for 
various  subsets  of  the  parameters  known.  If  both  a and  b, 
or  a and  c,  are  known,  solution  of  the  appropriate  remaining 
gradient  element  gives  a unique  M.L.  estimator,  as  is 
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verified  by  examination  of  the  numerical  behavior  of  the 
applicable  conditioned  Hessian  term. 


If  b and  c are  known,  a Jacobian  transformation 


v = (x  - c)  , 


gives  an  exponential  density  for  v with  well  known 

_ A 

properties  lrcluding  uniqueness  for  a. 


If  only  b is  known,  the  resulting  solution  for  the 
M.L.E.  is  unique.  McCooi[157]  shows  that  if  only  c is 

known,  the  remaining  fi.L.E.  are  unique.  We  shall  see  later 

A 

that  knowledge  of  a is  of  little  value,  since  a can  be 

A A 

derived  as  a function  of  b and  c. 


Proceeding  with  the  general  three  parameter  case,  it 
will  be  reassuring  for  purposes  of  validation  to  show  that 
the  expectation  of  the  gradient  satisfies 


E[  V L(X,T)  ] = 0 . 

T 


This  derivation,  and  others  which  follow,  require  definition 
of  several  mathematical  functions  and  identities.  For 
scalar  z>0,  the  gamma  function  is  defined  as: 


T z-1  -y 

(z)  = / y e dy  , 


witn  tne  useful  relation 


p< z+i)  = zf(z)  , 


and  the  derivatives  P'(z)  and  p"  (z)  , are  defined  by: 


(i)  (i)  z-1  -y 

P (z)  = / In  (y)  y e dy  , i = 1,2,.. 


Also,  there  are  tabulations  given  by  Gauss[96],  H.  Davis[o7] 
and  P.  Davis[69]  of  the  digamma  (Psi)  function 


f(z)  = V lnP(z)  = p*  (z)/P(z) 

7. 


which  has  the  recursive  property 

Y<z+1)  = f(Z)  + 1/z 

and  the  trigamma  function,  with  tabulations  presented  by  H. 
Davis[66]  and  P.  Davis[  69  ],  defined 


Y'(z:  = 7 [ VlnP(z)  ] P»(z)/r(z)  - cr*  (z)/p(z)  ] 

z z 


= P"<z)/r(z)  - r (z)  . 


This  will  permit  algebraic  substitution  using 


r*  (z)  = p<z)Y(z) 


p" (z)  = rwiv (z)  ♦ f (z)  ] . 


Now,  the  expectations  of  the  gradient,  V L(X»T),  are 

T 


E[  V ] = (nb/a)P(2)  - nb/a  = 0 , 
a 
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= n/b  ♦ (n/b)P*  (1)  - (n/b)P*  (2) 

= (n/b)  [ 1 ♦ P(1)f(1)  - p(2)^( 2)  ] 

= (n/b)  f 1 ♦ VO)  ~ f(2)  ] 

= (n/b)  [ 1 ♦ fO)  - Wl)  -13-0, 


E[7  j = -n[  (b-1)/a  ]P(1-1/b)  ♦ n[b/a]P(2-1/b) 
c 

= -n[  (b-l)/a  ]r(1-1/b)  ♦ n[  b/a  ] (1-  1/b)p(1-1/b) 

= 0 . 


The  symmetric  information  matrix. 


fl  = E[-H(T)  ] = {E[-h.  .])  , 

ID 


can  be  derived  from  the  terms  of  the  symmetric  Hessian  and 
the  Weibull  density,  i^(x,T),  as  follows: 


2 2 
EC-h  ] = (nt/a  )[  (b*l)P(2)-1  ] = n (D/a) 


E[-h  ] = (n/a)[  1-r(2)-P*  (2)  ] = - (n/djf  (2) 
= -(n/a)<^(2)  = - (n/a)  (0.42278***) 


= -(n/a)P(2)f(2) 


E[-h  ] = (b/a)  P(2-1/b)  , 


E[-h^]  = n/b  + P"(  2)  = n/b  + tf'  (2)  ♦ r2  (2)  3P(2) 


= n/c  + 0 . 466 1 9» • • 


74 


!PP!IP!PJP!S^  -HU'i  w 1 


E[-»»  ] - (n/a)[P(  1- 1/b)  ~P(2- 1/b) -bP1  (2-1/b)  ] 

= (n/a)  [f1(  1- 1/b)  -p(2-1/b)  [1  - bf(2-1/b)}]  , 

El -h  3 3 1 - n[  (b  - 1)/a2][p(1-2/D)  + bP(2-2/b)  ] 

= n[  (b-1)/a]2P(1-2/b)  . 

Ravenis[  iy8]  gives  the  information  matrix  for  the  Weibull 
family,  f (x,T),  and  Harter  and  floore[118]  give  similar 

numerical  results  for  singly  censored  Weibull  samples. 


We  recall  that  the  inverse  of  the  information  matrix  is 
the  Crainer-Rao  minimum  variance  bound  discussed  earlier. 
This  inverse  can  be  derived  algebraically,  but  the 
usefuliness  of  this  explicit  result  does  not  warrant  the 
space  and  effort  required  for  derivation  and  display  here. 
Although  Huzurba zar[ 1 32 ] has  shown  that  for  any 

(multivariate)  density  of  the  Koopman-Pitman  family  -that 
is,  any  density  admitting  sufficient  statistics-  the  M.L.E. 

asymptotically  acnieve  the  bound,  so  that  the  inverse  cf  the 

A 

information  matrix  is  the  variance  covariance  matrix  fcr  I, 
the  full  parametric  Weibull  family  is  not  a Koopman-Pitman 
form.  Fortunately,  Halper in[  1 1d  ] generalizes  the  Crainer-Rao 
minimum  variance  bound  result  under  mird  regularity 

conditions  to  any  density  and  also  establishes  asymptotic 
unbiasedness,  consistency  and  normality  for  M.L.E. 

For  the  Weibull  family  of  densities,  rl.L.  estimation  is 

reg  ularf  1 18  ] for  complete  samples  only  if  the  location 

parameter,  c,  is  known,  or  if  the  shape  parameter,  t,  is 

qreater  than  2.  We  can  verify  above,  in  fact,  that  the  term 

E[-h  ] in  the  information  matrix  has  a singularity  for  b=2, 
33 
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bat  that  dll  terms  are  well  defined  for  d>2.  Huber[1J1]  and 
Le  Cam[l49]  investigate  the  effect  of  nonieqalarity  on 
M.L.E.  properties. 

Since  for  any  regular  parametric  estimation  the  M . L . K . 
asy mptot ically  achieve  the  Cramer-Pao  bound  given  by  the 

A - ; 

inverse  of  the  information  matrix,  E[-H(T)]  , and  this 

inverse,  a variance  covariance  matrix,  is  known  to  be 

positive  definitef  141, p. 56],  we  may  take  some  comfort  in 

A 

the  knowledge  that  at  least  in  expectation  B (T)  is  negative 
definite.  A stronger  result  follows  immediately  for  simply 
consistent  M.L.E.,  which  must  converge  in  prooaDility  to  a 
single  set  of  parameters.  Thus,  even  for  a problem  with 
multiple  solutions,  the  coordinates  of  the  maxima  must 
approach  the  same  point  asymptot  ical  ly . Chardra[46] 
presents  evidence  for  consistency  of  M.L.E.  under  very 
general  conditions.  We  also  recall  that  M.L.E.  are 
asymptotically  functions  of  sufficient  statistics,  when  such 

exist.  Huzurbazar[ 1 32 ] shows  that  sufficiency  a2so  imriies 

A 

uniqueness  for  T. 


Choi[48]  gives  estimates  of  bias  for  M.L.E.  in  the 
closely  related  Gamma  family  of  densities,  finding  the 
magnitude  of  bias  to  be  small  even  for  intermediate  sample 
sizes.  Harter  and  Moore[117]  suggest  that  a local  maxima, 
though  not  a true  M.L.E.  in  the  strictest  global  sense,  can 
exhibit  most  of  the  desirable  statistical  properties. 


For  finite  sample  sizes  in  cases  such  as  the  Weibull 
where  no  sufficiency  can  be  established,  the  asymptotic 
results  above  do  not  necessarily  hold.  However,  Harter  and 
Moore[119]  have  performed  extensive  simulation  studies  for 
the  three  parameter  Weibull  family,  and  give  tables  showing 
that  the  Weibull  M.L.E.  achieve  their  asymptotic  variances 
very  quickly,  with  the  actual  variance  exceeding  the 

7o 
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Cramer-Rao  bound  by  an  amount,  proportional  to  1/n  . Thus, 

their  results  suggest  that  the  inverse  of  the  information 
matrix  is  valid  as  an  estimate  of  the  M.L.E.  variance 
covariance  matrix  for  samples  as  small  m size  as  100.  When 
a,  or  b,  are  known,  this  result  is  achieved  much  more 
rapidly . 

As  mentioned  earlier,  when  the  shape  parameter,  b,  is 
less  than  2,  a nonregular  estimation  results,  and  if  b<1, 
L(X,T)  is  not  oounded.  Pike[185]  suggests  that  when  the* 
likelihood  function  is  unbounded  in  the  feasible  parameter 
space,  a local  maxima  provides  a reasonable  estimate.  It  is 
easiest  tc  constrain  the  range  of  the  shape  parameter  to 
avoid  this  problem,  since  in  a pure  £>ense  the  M.L.  method  is 
inapplicable  otherwise.  Rockette,  Antle  and  Klimkc[202] 
suggest  substitution  of 


+ + ♦ ♦ 

T = { a , b , c } , 


with 


>+  = .£  (x.  - 

1=1  l 


c)  /n  , 


and 


b = 1 , 


c = x 


[1] 


for  cases  in  which  no  M.L.  solution  exists.  For  cases  in 


which  a maximum  is  achieved,  it  is  compared  with  T , and  the 
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A 

M.L.  solution  T is  chosen  to  correspond  with  the  higher 
likelihood. 


As  a useful  simplification  of  the  three  parameter 

Weioull  estimation,  the  scale  parameter  can  be  eliminated 

from  the  system  by  solution  of  V =0  and  substitution,  which 

a 

gives 


A 

a 


b 1/b 
c)  /n  ] 


so  that 


* A 

T = { a,  b,  c } , 


for  which  the  conditioned  log  likelihood  function  becomes 


* n b 

L (1,1)  = n In  (b)  - n ln[.2^(x.  ~ c)  /n] 


+ (b-1)  ^ ln(x  - c)  - n , 

i = 1 i 


with  gradient 


V n/b  - n[  (x  - c)  ln(x  - c)  / ^ (x  - c)  ] 

b i = 1 i i i=  1 i 


f it  1 .1  ( A - C)  , 

1 I 1 


* i,  b-1  n b n 

V n b 2_  (X  . - c)  / EL  (X  . - c)  - (b-1)  £ [ 1/(x  - c)  ] . 

c 1=1  1 1=1  1 l = 1 i 


The  symmetric  Hessian  , § , for  this  reduced  problem  is 
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h = - n/b 
22 


n b i 

- n r (x  - c)  In  (x 
i=  1 i i 

n b 

♦ n[  f (x  - c)  In  (x 
i = 1 i l 


C)  / i (X . 

i=  1 i 


- c) 


- c)  / 


.£.(*.  - 

1 = 1 1 


b 2 

c)  3 , 


b- 1 


b-1 


23 


n{b  (x  - c)  ln(x  - c)  + JI  (x  ~ c)  ) 
i= 1 i i i=1  i 


/jL  (x.  * c) 

i=1  l 


fb  n b-1 

(x  - c)  In  (x  - c)  2_  (X  - c)  } 
i=1  i l i=1  i 


b 2 


/[  X (x . - C)  ] 

1 = 1 l 


.£.(x.  - 

1=1  i 


C) 


-1 
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2 n b-1  2 

nb  [ <xi  ‘ °)  J 

b n 


n bn  b-  2 n 

- 1L  (X  - C)  X (X  . - C)  ] /[  £ (X . 

1=1  1 1=1  1 1=1  1 


b 2 

c)  3 


n b-2  n 

+ nb  21  (x  - c)  / .21  (x . - 

i=1  l i=1  l 

n -2 

* (fc-1)  [x  - c]  . 

i = 1 l 


C) 


* 

The  expectation  of  H is  hopelessly  difficult  to  derive,  so 

that  additional  assertions  of  uniqueness,  or  other 
properties,  are  not  achievable  by  that  method  for  this 
reduced  system.  However,  Peto  and  Lee[184]  treat  each 
element  cf  the  gradient  as  an  implicit  function  of  the  other 
respective  M.L.  estimator,  and  plot  the  two  trajectories  for 
* 

7 L (X,T) =0,  showing  the  M.L.  solution  as  an  intersection. 

'rt 

X 
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hockettef 20 1 ] carries  this  treatment  further,  and  obtains 

strong,  tut  net  conclusive  evidence  via  lengthy  term  by  term 
inequality  arguments  for  tne  sums  of  powers  of  X found  in 
the  conditioned  Hessian  (drawing  from  Hardy,  Littlewood  and 
Polya[1l4])  that  at  most  one  maximum  exists  fir  the  three 
parameter  Weioull  M.L.  problem,  and  that  tnere  is  a saddle 
point  associated  with  each  maximum.  Extensive  empirical 
evidence  supports  these  assertions. 

A generalization  of  the  Weibull  likelihood  model  is 
possible  fer  other  than  complete  samples.  Cohen[ 52  ] 

classifies  the  process  by  which  elements  have  been  censored 
from  a sample  of  failure  times  in  life  testing  as  being  of 
Type  I when  the  sampling  process  is  stopped  at  some 
predetermined  time,  or  Type  II  when  testing  ceases  after 
some  fixed  number  of  elements,  k,  have  failed.  For  Type  I 
censoring,  the  number  of  observed  failure  times,  k,  is  a 
random  variable,  while  Type  II  censoring  provides  a random 
sampling  cessation  time.  For  either  type  of  censoring,  when 
only  the  first  k out  of  n elements  have  been  observed,  the 
Weibull  likelihood  function  is 

K n * K 

L (X#T)  = ln{[n!/(n-k)  ! ] I[  f (x . ,T)  ] R (x  ,T)  ‘ } 

k i=  1 2 l 2 k 

= ln[nJ/(n-k)!]  + k[  in  (b)  - b In  (a)] 

+ (b-1)  i ln(x.  - c)  - f [ (x.  - c)/a] 
i=1  l i=1  i 

b 

- (n-k)  [ (x  - c)  /a  ] 

k 

with  x fixed  for  Type  I censoring,  and  x =x  for  Type  II. 
k k [ k ] 

linger  and  Spr inkle[  200  ] and  Cohen[  53  ] discuss  M.L.E. 
for  the  censored  W\.ibull  density  with  c=G,  and  Harter  and 


HO 


V 


Hoore[11o]  study  the  three  parameter  case  for  the  Weibull 
and  Gamma  families. 

Progressive  censoring  of  either  type  occurs  when  at 

successive  stages  a number  of  survivors  are  randomr y 

selected  for  removal  from  further  testing.  Suppose  that  s 

stages  of  censoring  occur  at  progressive  times  y , 

i 

j= 1 ,2, . . . „s,  with  k^  functioning  elements  randomly  removed 

at  each  stage  from  further  testing.  Progressive  Type  I 

censoring,  with  the  predetermined  constant  times,  Y, 
produces  a random  sample  size 


- = " • £ 


N * 


and  has  log  likelihood 


L1  (X,T)  = ln(  C 'frf  (x.,T)  (y  ,T)  * (j)  } 

Y,K  1=1  2 l j=1  2 j 

= In  (C)  ♦ m ln(b/a)  ♦ (b-1)  ,^ln[  (x  - c)/a] 

- £ J (x  . ~ c)  /a  ]b  ♦ jt  k .[  (y  - c)  /a  ] 
i=1  l j=1  J 3 


with  C a combinatorial  constant. 


For  Type  II  progressively  censored  sampling,  with  the 

number  of  censored  elements,  k , fixed,  and  the  times  of 

3 

censorship  occurring  randomly  with  each  failure,  the  log 


likelihood  is 


■ 


L"k<X'T» 


s i-1  K(l) 

1 n £ It  C ( n - Z[k-i*1])  f (Y.»T)  ft  (y.,T)  ]} 

1=1  3=1  3 2 l 2 i 

fln{[  (n  - Z\k  -i*1]> 

1=1  3=1  ] 


♦ (b-1)  .Zln[  (y  . “ o)  /d  ] 

1=1  l 

~ fe,t  (y  . - c)  /a  ]U  + n.  [ (y  . 

1=  1 1 1^1  1 1 


b 

c)  /a  ] 


The  gradient  and  Hessian  matrix  for  eacn  of  these 
models  is  not  included  here.  It  should  be  pointed  out, 
however,  that  the  scale  parameter  can  be  eliminated  in  each 

* 

model  in  precisely  the  same  manner  that  gave  L (X,T) . 

II 

ttinger  and  Sprin  kle[  200  ] give  the  gradient  for  L (X,T) 

Y , K 

when  c = 0,  and  Cohen[53]  gives  the  gradient  and  Hessian  for 


f^(X,T)  with  c=0.  Wingo(234  ] gives  the  gradient  and  Hessian 
for  Type  I progressive  censoring  of  f (X,T) . 


Harter  and  Moore[118]  give  the  gradient  and  Hessian  for 
doubly  censored,  or  truncated,  samples  m which  the  first  k 
and  last  r elements  have  not  been  observed.  Tne  form  of  the 
log  likelihood  function  is  very  similar  to  the  singly 
censored  case.  An  interesting  result  of  these  empirical 
studies  cf  censored  sampling  is  that  when  the  location 
parameter,  c,  is  not  bounded  from  the  left  a higner  variance 
results  for  c than  when  c is  constrained 

0 < c < x 

[1] 
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Rockette[ z0 1 ] reports  chat  this  effect  seems  to  increase 
with  the  ratio,  c/a.  Polfeldt[  18d,  187  ] has  given  some 
limited  theoretical  results  for  nonregular  estimation  of 
location  parameters.  Antle  and  Bain[ : j have  given  several 
interesting  transformations  of  scale  and  location  parameters 
which  ate  statistically  independent. 

Note  that  a numerical  singularity  occurs  when  c 

approaches  x too  closely  in  a complete  sample.  This 

[1] 

A 

suggests  using  c=x  and  dropuinq  x from  the  sample  when 

m m 

necessary.  For  large  samples  this  heuristic  can  be  extended 
to  the  censoring  of  all  sample  elements  in  the  close 


neighborhood  of  x 


As  a practical  matter,  this 


adjustment  can  circumven.  serious  difficulties  with  M.L. 

estimaticr  for  some  samples,  but  it  is  somewhat  distasteful 
to  peremptorily  discard  costly  sampling  information  in  this 
way.  Also,  strongly  assymetric  censoring  can  introduce  more 
bias  for  the  M.L.E.,  and  of  course  increase  M.S.E.  Harter 
has  reported  that  even  for  sample  sizes  of  10  and  20,  bias 
is  net.  severe  undeL  moderate  censoring  and  the  theoretical 
variance  cf  the  M.L.E.  is  not  greatly  exceeded. 


The  reluctance  with  which  T approaches  the  Ciamer-Rao 
bound  for  intermediate  sample  sizes  can  be  overcome  by 
constraining  the  shape  parameter,  b,  to  a feasible  range 
known  uy  the  investigator.  The  M.S.E.  can  be  lowered 
signii icant ly  by  such  precaution,  since  the  Weibull  density 
function  and  consequently  the  likelihood  objective  function 
are  very  unruly  for  hiqh  values  of  b,  unbounded  for  b=1,  and 
nonregular  for  n<2.  If  we  constrain  b to  values  between, 
for  instance,  1 and  4 we  have  still  included  a very  robust 
parametric  family  in  our  investigation,  but  one  with  less 
habitual  inclination  to  provide  ridiculous  likelihood 
estimates  for  purely  numerical  reasons. 


For  very  small  samples,  an  investigator  is  faced  with 
the  unfortunate  paradox  that,  although  the  objective 
function  and  its  derivatives  are  easily  and  quickly 
evaluated,  the  likelihood  surface  can  exhibit  a tortuous 

landscape.  Perhaps  this  is  fortuitous,  tor  otherwise  one 

A 

would  be  tempted  to  rely  on  T despite  its  unknown 
statistical  properties.  The  irregularity  introduced  ny 
including  the  location  parameter,  c,  in  the  search  is  most 
troublesome  for  these  cases.  The  frequent  occurrence  of  a 
stationary  saddle  point  usually  takes  place  at  parametric- 

coordinates  relatively  close  to  the  upper  bound  for  c,  x ; 

[ 1 ] 

however  the  saddle  point  can  lie  well  within  the  ranqe  of  c 
for  small  samples,  making  it  difficult  to  consistently 
identify  and  avoid  numerically. 

If  a sample  is  used  for  the  .I.L.  estimation  with  the 
Weibull  model  that  actually  comes  from  some  markedly 
different  population,  the  results  can  be  disastrous  even  for 
large  sample  sizes.  It  is  worthwhile  to  remember  that  the 
statistical  theory  underlying  this  estimation  process 
req  uires  that  the  hypothetical  assumption  of  the  density 
function  for  which  point  estimates  are  sought  must  be  based 
in  fact.  Two  singular  examples  of  such  (large)  samples  have 
come  to  the  author’s  belated  attention  in  this  regard;  one 
was  subsequently  identified  as  coming  from  a paLeto 
population,  and  the  other  was  ultimately  determined  to  be  a 
sample  from  a beta  density.  These  samples  wreaked  numerical 
havoc  with  several  optimization  codes  applied  to  the  Weibull 
model,  the  first  because  of  too  many  sample  elements  in  the 
extreme  right  tail  for  the  Weibull  density  to  fit,  and  the 
second  due  to  the  effect  of  a finite  upper  domain  limit  for 
a symmetric  sample.  Both  samples  produced  apparent 

A 

stationary  saddle  points,  numerically  unbounded  T,  and 
infinite  likelihoods  at  various  times.  Thus,  great  care 
must  be  taken  in  applying  any  numerical  K.L.  procedure  to 
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the  Weibull,  since  termini  t.  a stationary  point  on 
L(X,T)  should  be  allowed  only  for  a maximum,  or  the 
optimization  problem  should  terminate  with  indication  of  no 
achievable  finite  optimum. 

Inferential  techniques  based  on  finite  samples  tor 
Weibull  and  other  closely  related  models  are  given  by  Harter 
and  ftoore[115],  Bain  and  Weeks[9],  Thoman,  Bain  and 
Antle[221],  Bain[  8 ] and  Billman,  Antle  and  Bain[23],  Most 
of  these  investigations  give  tables  which  are  developed  by 
estensive  simulation. 

M.L.  estimation  of  the  reliability  function  is  shewn  to 
be  surprisingly  unbiased  and  robust  by  Hager,  Bain  and 
Antle[112]  and  others. 
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C.  NUMERICAL  APPROACH 


The  methods  commonly  used  lor  numor ically  determining 
Weibull  M.L.E.  have  historically  included  cyclic  search  by 
Harter  and  Mcore[116],  second  order  Newton-Raphson  ascent  by 
Peto  and  Lee(174]  and  Ringer  and  S prinkie[  200  ] , and 
Quasilinearization  (New ton-Raphson  ascent  with  the  Hessian 
approximated  by  differences)  by  Wingof 233, 234 ]. 

The  magnitude  of  effort  involved  in  applying  a search 
technique  tc  M.L.  estimation  is  implied  by  the  iteration 
limit  suggested  by  Harter  and  Moore  of  550  cycles  for  tne 
Weibull  model  and  1100  cycles  for  a Gamma.  Barnet  te[  1'*  1 
suggests  cyclic  search  for  likelihood  models  with  multiple 
roots;  such  cases  usually  occur  only  for  small  samples  from 
selected  density  functions.  The  Weibull  model  has  never 
empirically  exhibited  finite  multiple  optima,  althougn  some 


small  sample  estimations  lead  to  numerical  difficulties 
characterized  by  a stalling  of  the  iteration  over  a 

respectable  neighborhood. 

In  a refreshingly  honestly  titled  article.  Mantel  and 

Myers[162]  report  that  for  second  order  ascent  methods 

choice  of  the  starting  value,  T , is  vital  to  success. 

0 

(This  is,  of  course,  no  theoretical  surprise  to  a numerical 

analyst.)  As  we  have  seen,  the  Weibull  model  seems  to 

produce  a saddle  point  as  a gratuitous  companion  of  a 
maximum.  For  this  reason,  pure  second  oruer  representations 
of  the  problem  have  consistently  not  lean  to  acceptable 
performance  of  the  optimization  algorithm. 


Kaie[ 1 37, 138 ] compares  the  second  order  Newton-Raphson 
and  Fisher's  scoring  methods  and  indicates  that  they  are 
most  applicable  for  large  samples.  Onf or t ana  tel y , both  his 


iteration  and  residue  criteria  tor  rate  of  convergence 
evaluation  are  not  directly  related  to  computer  time,  and 
his  sample  problem  is  a fairly  uncomplicated  two  parameter 
estimation.  Michelinif 167  ] gives  a method  tor  selecting 
starting  values  for  the  scoring  method  applied  to  a 
lognormal  model,  and  presents  fascinating  graphical 
depiction  of  empirical  regions  of  convergence. 

Implementation  of  both  first  and  second  order  ascent 
methods  and  search  techniques  for  the  Weibull  models  have 
produced  tne  following  conclusions.  The  first  order 
gradient  methods  are  superior  to  search  techniques  for 
reasons  of  speed,  and  are  better  than  second  order  iteration 
on  the  basis  of  reliable  convergence.  The  saddle  point 
dilemma  is  most  expeditiosly  resolved  by  use  of  first  order 
methods  ana  a solution  verification;  convergence  to  a saddle 
point  is  very  rare  for  the  constrained  parameter  problem  and 
sample  sizes  greater  than  10. 

All  techniques  regularly  fail  for  small  samples.  It  is 
suggested  that  for  these  cases  either  the  sample  has 
insufficient  information  to  warrant  M.L.E.,  or  the  wrong 
density  is  being  used  for  parametric  estimation. 

A hybrid  ascent  method  which  produces  both  fast  and 
dependable  convergent  *>  for  highly  nonlinear  problems 
utilizes  both  first  and  second  order  representations  of  the 
maximization  problem.  The  first  order  formulation  is  used 
to  begin  the  solution,  and  continued  until  the  amount  of 
information  in  the  linear  term  of  the  objective  function 
approximation  diminishes  significantly  below  the  remaining 
higher  order  terms  in 

L (T  ♦ AT)  = L(T  ) ♦ 7L(T  ) 'AT  + rem(T  ,AT) 

0 0 0 0 
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The  switching  rule  requires  that  a sequence  of  solutions  oi 
specified  length  exhibit  a moving  average  of 

avg  { L (T^)  / remfT^AT)}  S B^  . 

B has  been  set  at  0.1.  The  transition  to  the  second  order 

4 

representation  is  not  guaranteed  for  every  problem,  since  it 

occurs  only  when  higher  order  terms  dominate  the  local 
approximation  of  L. 

To  test  these  methods,  50  samples  of  size  100  were 

randomly  generated  from  Weibull  families  with  (arbitrarily) 

a = 50,  c=100,  and  b = 1 . 5, 2 . 0 , 2 . 5 , 3 . 0 . For  all  estimations  a 

total  computation  time  was  recorded  with  the  iteration 

records.  The  host  computer  was  an  IBM  360/67  - II  with 

optimized  FOKTRAN- IV (H)  . A pure  first  order  representation, 

and  the  hybrid  scheme  were  used  on  the  matched  set  of  random 

samples-,  with  B =0.1  and  the  length  of  the  moving  average 

4 

for  the  switching  rule  set  at  two. 


The  results  were 
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AVERAGE  PERFORMANCE 


b 

METHOD 

a 

b 

c 

Time  (sec) 

ITERATIONS 

1.5 

I 

47.95 

1.56 

101.28 

46.0 

59. 3 

4.47 

.19 

1.67 

II 

46.01 

1.52 

101.27 

18.2 

21.3(12.7) 

4.94 

.17 

1 . 55 

2.0 

I 

48.56 

1.98 

101 . 19 

50.9 

59.2 

5.00 

.30 

3.28 

II 

48.54 

1.97 

101. 19 

20.9 

19.6  (4.8) 

4.98 

.30 

3.29 

2.5 

I 

49.23 

2.49 

100.83 

45.7 

51.8 

5.26 

.37 

4.92 

II 

49.25 

2.49 

100.80 

19.  1 

17.0  (3.9) 

5.22 

.36 

4 .90 

3.0 

I 

47.80 

2.97 

101.75 

51.4 

67.6 

6.69 

. 56 

6.21 

II 

47.76 

2.99 

101.72 

16.6 

15.9(4.1) 

6.64 

.55 

6.18 

The  root  mean  squared  error  is  given  just  below  the  sample 
■ean  for  estimators  of  each  parameter.  The  number  in 
parentheses  following  the  average  iterations  reguired  for 
convergence  of  the  second  order  scheme  indicates  the  average 
number  of  first  order  iterations  required  to  trigger  the 
switching  rule.  There  were  no  cases  for  which  convergence 
was  not  achieved.  The  second  order  representation  clearly 
dominates  these  results. 

For  b=1.5,  seven  cases  converged  with  results  which 

♦ 

were  replaced  by  the  solution,  T , (with  b=1)  on  the  basis 

of  likelihood  comparison.  Samples  generated  for  other 
higher  values  of  b produced  no  such  replacements.  For 
samples  from  the  population  with  b=3.0,  six  cases  converged 
to  the  upper  numerical  bound,  b=4,  ana  did  not  achieve 
transition  to  the  second  order  representation.  Since  this 
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seemed  to  have  no  serious  effect  upon  the  expect  at? on  of  b 
for  this  set  of  samples,  no  further  action  was  taken, 
haisinq  the  bound  for  b might  be  indicated  in  other 
investiqatioi:  contexts. 

All  estimations  were  started  with  the  initial  solution 


b-2  and  c-0.9  x 


[1] 


In  order  to  test  the  robustness  of  tne 


numerical  methods  with  respect  to  starting  valuss,  the  same 

sets  of  random  samples  were  used  with  initial  values  of 
b=1.5,2.  C,3. 0.  The  numerical  values  of  the  H.L.iS.  results 
for  each  sample  showed  almost  no  sensitivity  to  starting 
value,  and  the  average  number  ot  iterations  required  for 
convergence  and  the  computation  time  consumed  were  not 
significantly  different,  although  individual  samples  did 
occasionally  exhibit  large  variations.  For  several  samples, 
the  first  order  method  converged  to  values  differing  in  the 
hundredths  position.  Such  variation  was  never  evident  for 
the  second  order  hybrid  iteration. 


The  behavior  of  the  objective  function  during 
optimization  with  the  hybrid  second  order  method  was 
remarkably  consistent  for  all  samples.  At  the  initial 
iteration,  the  first  order  term  in  the  linear  approximation 
of  the  objective  function  dominated  the  remainder  term. 
After  several  iterations,  as  indicated  in  the  performance 
data,  transition  to  the  second  order  representation  of  the 
problem  took  place,  after  which  no  more  than  three 
iterations  produced  a solution  for  which  the  linear 
approximation  of  the  second  order  representation  objective 
function  left  a remainder  term  several  orders  of  magnitude 
smaller  than  the  linear  term.  The  final  likelihood  achieved 
by  the  second  order  scheme  was  higher  than  that  given  by  the 
pure  first  order  method  in  every  case,  but  the  difference 
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never  exceeded  a relative  magnitude  of  10 

By  "tuning”  the  iteration  schemes,  significant  further 
reductions  in  computing  time  are  possible.  This  is 
especially  true  if  the  resolution  of  the  termination 
criteria  is  relaxed.  This  is  usually  a reasonable  course  of 
action,  since  the  values  of  the  tt.L.E.  are  not  really 
required  to,  say,  five  decimal  places.  Such  a change  for 
the  second  order  model  with  b=2  produced  an  average 
convergence  time  of  9.7  seconds  in  11.3(3.4)  iterations. 


It 

is 

important  to 

note 

that  for  various  sample 

sizes. 

terminat 

ion 

has  always 

been 

achieved 

without  num 

erical 

processo 

r 

interrupts. 

That 

is,  the 

algorithm  detects 

terminal 

si 

ngularities. 

and 

indicates 

them  under 

full 

control 

of 

the  program 

, permitting  remedial  action 

to  be 

taken,  or  simply  allowing  analysis  of  the  complete  output. 
This  is  far  superior  to  the  spectacular  results  to  be 
expected  from  most  Newton-Raphson  based  programs. 

There  is  no  theoretical  reason  prohibiting  formulations 
of  even  higher  order  representations  of  nonlinear 
optimization  problems.  The  activation  for  such  an  approach 
would  be  a highly  nonlinear  problem  for  which  the  first 
order  approximation  of  the  imbedded  local  linear  program  can 
be  expected  to  give  a poor  representation  of  functions,  tven 
with  the  second  order  formulation.  The  algebraic  demands  of 
higher  order  representations  of  likelihood  models  and  the 
consequent  debugging  and  computational  expense  of  the 
associated  higher  order  problems  do  not  promise  much 
practical  value.  Portuna+ely,  the  likelihood  models 
investigated  thusfar  have  not  reached  such  nonlinear 
extremes  within  the  numerical  capabilities  of  a digital 
computer,  as  is  attested  by  the  dominated  remainder  terms 
for  the  second  order  representation.  However,  it  is 


possible  that  foe  other  types  of  highly  nonlinear  probler.s 
high  order  representations  will  prove  a fruitful  field  for 
further  research.  A sequential  transition  nechaniso  such  as 
that  proposed  uere  aay  also  provide  for  robust  convergence 
with  higher  erder  formulations  as  it  has  done  in  the  present 
investigation. 
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D.  A USiiilliEAiLX  CONSTRAINED  PROBLEM 


Consider  the  following  problem.  A random  sample  is 
collected  from  a Weibull  nopulation  whose  mean  is  known,  but 
whose  parameters  are  to  be  estimated  by  H.L.  Such 
situations  arise,  for  instance,  when  census  information  is 
used  in  conjunction  with  random  survey  data  for  demographic 
modelling. 

As  another  example,  suppose  a bank  wishes  to  use  a 
random  sample  to  estimate  the  parameters  of  a Weibull 
density  function  describing  the  size  of  an  individual 
depositor's  account.  Certainly  the  bank  will  know  exactly 
the  total  of  money  on  deposit  and  the  number  of  depositors. 
Thus  the  mean  of  the  density  is  known,  but  not  the 
parameters. 

The  H.L.  formulation  of  such  a constrained  problem 
becomes 

MAX  L (X,T) 

T 

s.t.  c ♦ aP(1*1/b)  = , 

0 < a < ••  , 

1 < b < 4 , 

0 < c < x 

[1] 

The  constraint  has  gradient  elenents 

g = r<1*Vb)  , 
a 
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g * - <a/b  )f«  (1  + 1/b) 

D 


-(a/b  )P(H1/b)f(U1/b)  , 


g = 1 , 
c 


and  the  Hessian  for  the  constraint,  H,  has  the  ncnzero 
teras: 


h = -l/bV'  (1+  1/b) 


= -1/b  f(U1/b)f (U1/b)  , 


h = (a/b3)r*  <1+ 1/b)  ♦ <a/bV«(1*1/b) 

= (a/b4)P(H1/b)[by(U1/b)*f'  (H1/b)V  (1t1/b)  ] 


As  before,  the  scale  paraaeter,  a,  may  be  substituted 
out,  leaving 


MAX  L (X,  T) 
T 


S.  t. 


IV  b 1/b 

c ♦ [ X^(x  - c)  /n]  P<1  ♦ 1/b)  = 


">  - 


1 < b < 4 , 
0 < c < x 


The  gradient  for  this  reduced  problem  is 
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* n,  b 1/b„ 

qb  * f £^(x  " c>  /n3  P(1*1/b) 

((1/b)[£  (x4  - c)bln(x.  - c)/  f (x.  - c)b 
i=1  i 1 i=1  i 

- < 1/to)  In  (,£  (x  - c)  b/n)  ] 
i = 1 i 

♦ f(1  + 1/b)  } , 

* — jl  b 1/b- 1 n b- 1 

g = -p(U1/b)[  £ (x  - c)  /n]  [ .£  ( * . - c)  /n  ] . 

c i=1  i 1=1  i 


The  Hessian  for  the  constraint  Kill  not  be  given  here. 

As  a test  of  this  model,  ten  samples  of  size  100  were 
randomly  generated  with  a = 5Q,  b=2 , and  c=100,  and  the 
constrained  aean,  ■ # was  set  at 


■ ^ = c ♦ ap(1  + 1/b)  = 100  ♦ 50P(  1 . 5)  = 144.31=»»  . 


The  three  variable  aodel  was  run  for  both  first  order 
and  hybrid  schemes,  with  the  switching  rule  qualified  to 
activate  for  feasible  solutions  only.  The  results  were 


AVERAGE  PERFORMANCE 

n b METHOD  a 

b 

C 

Time  (sec) 

ITERATIONS 

100  2.0  I 

48.40 

1.98 

101.21 

132.8 

135.4 

4.65 

.38 

3.95 

II 

49.11 

2.00 

101.27 

89.7 

87.2(6.4) 

4.69 

.39 

4.02 

One  sample  did 

not  make 

a transition 

to  the 

second  order 

representation 

before 

convergence, 

and  two 

of  the  samples 

reguired  200  iterations 

for 

termination. 
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The  choice  of  width  for  the  equation  band  representing 
the  equality  constraint  feasibility  region  and  the  manner  in 
which  this  band  is  closed  during  the  progress  of  the 
solution  is  aost  important  for  insuring  success.  Preaature 
closing,  or  choice  of  a band  too  narrow  can  cause  the 
methods  to  stall,  especially  for  the  second  order 
representation  which  has  polygamaa  terms  that  are 
exceedingly  difficult  to  compute  precisely.  On  the  other 
hand,  too  wide  a band,  or  too  much  delay  of  the  closing  can 
lead  to  excessive  iterations  involving  infeasible  solutions. 
The  choice  of  a bandwidth  of  0.1  was  made  in  these 
applications  with  good  success  and  the  band  was  closed  by  32 
successive  bisections  for  feasible  solutions.  Upon 
transition  to  the  second  order  representation  of  the 
problem,  it  was  determined  that  a reinitialization  of  the 
equation  band  had  a desirable  effect  on  convergence. 

The  superiority  of  the  second  order  representation  of 
the  constrained  model  would  be  enhansed  greatly  by  the  use 
of  efficient  and/or  accurate  polygamma  functions. 
Currently,  the  best  series  approximations  derived  produce 
only  six  decimal  place  precision,  and  their  computation 
reguires  almost  half  of  the  iteration  time  reported  above. 

Other  side  constraints  can  be  added  to  parametric  M.L. 
estimation.  For  instance  prior  knowledge  of  the  population 
variance,  or  other  moments,  can  be  used  in  the  estimation. 
The  numerical  details  follow  directly  from  the  example  given 
here. 

The  results  for  both  classical  unconstrained  and 
constrained  models  reported  here  for  the  parametric  Weibull 
family  apply  with  remarkably  little  modification  to  the 
general  gamma  family  of  densities  as  well.  Regularity 
conditions,  gradients,  Hessians,  numerical  approach  and  so 
forth  follow  the  Weibull  examples  very  closely. 
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CHAPTER  IV 
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Il!2fiopiyciiou  jo  a p^ruoulli  regression  model 


Consider  a structural  model  based  on  the  observed 


sample 


(*  / I)  , 


where  y ^ is  one  of  a set  of  n (statistically)  independent 
discrete-valued  observations  with  m associated  parameters 


X - { X , « « . , X l • 

3 J1  3» 


In  this  usage,  X is  often  called  the  set  of  (structurally) 
independent  variables,  and  Y is  referred  to  as  the 
associated  (structurally)  dependent  variable,  with  T a set 
of  model  parameters. 

As  a specific  example,  suppose  that  I is  a set  of  "1-0" 
observations  of  "success,  or  failure"  from  n Bernoulli 
trials.  He  nay  assert  that  y is  observed  with 

3 


f<y  I*  *p>  = f (y jip (X ^) ) = f(y  IP  ) , 


and  that  f is  a parametric  family  of  Bernoulli  densities 


. . y<3)  „ i-y(3) 

f(yJipJ)  = p.  (1-p.i 

J J J J 
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0 s < 1 ; y^  = 0,  1 . 


In  the  regression,  p is  some  statea  mathematical 
function  of  the  independent  variables,  X^,  with  parameters. 


P = P (X  # T)  # 


and  is  interpreted  as  the  prior  probability  of  success  for  a 
Bernoulli  trial  carried  out  under  a given  set  of  conditions. 


To  illustrate,  suppose  that  p is  the  probability  of 

destroying,  or  disabling,  a target  with  a volley  of  shots  in 

a naval  bombardment.  The  success  of  each  volley  can  be 
considered  as  an  observation,  y^,  from  a Bernoulli  density 

with  parameter  p_^.  Clearly,  the  probability  of  success  on 

each  attempt  is  a function  of  distance  to  target,  sea 

conditions,  weapons  employed,  visibility,  and  so  forth  - 
characteristics  which  constitute  X . 

j 


If  we  employ  the  theory  of  ballistics  to  determine  a 
functional  form  for  p(X  ),  and  if  a record  is  kept  by  tne 

j 

lire  director  of  each  volley,  then  we  have  just  t.ne 
observations  required  to  estimate  p^. 


Consider  another  example.  Let  be  the  probability  of 
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a saog  alert  during  a particular  day.  If  a record  of  wind 
velocity,  wind  direction,  temperature,  nitrous  oxide  level, 
cloud  cover,  particulate  content,  and  so  forth,  is  kept 
daily  constituting  X^,  with  the  observation  of  a snog 

alert  for  that  day,  then  estiaation  of  p^  nay  be  attempted 

froa  n independent  observations  of  polluted  and  unpolluted 

days,  with  some  function,  p(X^),  supplied  by  the  researcher. 

The  Bernoulli  parameter,  p^,  may  be  the  probability  of 

default  on  a loan  given  credit  information  X^  the 

probability  cf  winning  an  election  given  a platform  and 

legislation  record,  the  probability  of  survival  given 
information  about  disease  and  treatment,  ad  infinitum. 

It  should  be  stressed  that  discrete  Bernoulli 
observations  are  often  available  when  continuous 
quantitative  inforaation  is  not,  or  when  continuous  measure 
is  inappropriate.  For  instance,  it  may  be  possible  to 
classify  an  individual  as  "poor"  while  to  use  a measure  of 
his  economic  income  would  be  difficult  or  impossible  due  to 
unreported  income,  government  subsidy  in  the  form  of  money, 
goods  and  services,  unclear  faaily  consuming  units,  and  the 
problematic  equivalence  of  income  level  with  the  quality  of 
life. 


As  another  illustration,  the  regression  analysis  of  a 
communications  satellite  launching  nay,  for  purposes  of 
research  budget  request,  properly  deal  with  the  probability 
of  successful  orbital  entry,  or  launch  failure,  rather  than 
with  orbital  apogee,  perigee,  period,  etc.  Thus  all  the 
information  concerning  launch  conditions  and  technology 
would  be  used  to  yield  a prior  probability  of  success,  a 
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result  more  tractable  for  management  and  more  Ciosely 
related  to  project  costs  than  estimates  of  orbital  physics. 


It  is  felt  that  the  general  class  of  problems  dealt 
with  here  is  important  and  previously  overlooked,  or 
misclassif ied  in  the  literature.  Several  Bernoulli  models 
are  presented  in  the  sections  that  follow.  Point  and 
interval  estimates  for  are  developed,  a hypothesis  test 

is  given  for  evaluating  the  contribution  of  parameters  to 

complicated,  realistic  models,  a stepwise  construction 
technique  is  proposed,  and  a heuristic  is  given  tor  choosing 
between  functional  forms  for  the  regression. 


y 


i, 

i- 
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C0H£A£2£Ojj  wilii  Zl&miun  analysis 


A technique  often  misapplied  to  Bernoulli  regression 
problems  is  that  of  discriminant  analysis.  Borrowing  prior 
notation  in  this  new  context,  a binary  discriminant  analysis 
provides  a decision  rule  for  classifying  an  individual  as  a 
member  of  one  of  two  populations  (tT,,U^)  from  examination  of 

a set  of  k properties,  1 ^ . Each  individual  is  asserted  to 

be  a permanent  member  of  only  one  of  the  populations.  The 

discriminant  analysis  attempts  to  determine  which  of  these 
mutually  exclusive  populations  contains  the  individual. 


Por  example,  the  Internal  Revenue  Service  in  this 
country  uses  a property  set  consisting  of  income  level, 

deduction  types  and  quantities,  etc.,  in  order  to  classify 

an  individual  filing  an  income  tax  return  either  as  a member 
of  the  population  of  chiselers,  or  honest  tax  payers.  Those 
classified  in  the  former  population  are  audited  in  detail 
for  errors  and  misrepresentations. 

Applications  occur  frequently  in  the  literature,  and 
classically  have  included  taxonomic  classification  by 
physical  measurement,  qualitative  biochemical  analysis, 
pattern  recognition,  identification  of  archeological 
remnants,  and  so  forth.  For  excellent  examples  see 
Fisher[85]  and  Nilson[176]. 


The 

members 

function 


discriminant  analysis  requires  use  of  n^  known 

of  IT#  and  n individuals  from  IT,  and  a density 
1 2 2 

for  the  property  set  of  each  population,  f (X)  and 


; 

> 


i 


! 


J 


101 


f2(*>- 


The  probability  of  an  observation  in  the  neighborhood 
of  the  point  given  that  the  individual  is  from  *7Tr  is 


f (X  ) dX  . 

1 j j 


This  probability  is  proportional  to  the  argument  f ^ (X_^) 

which  is  defined  as  the  likelihood  function  for  the  point 

X^.  The  fundamental  principle  of  discriminant  analysis  is 

to  classify  the  individual  as  a member  of  T»  or  T, 

1 2 

according  to  the  relative  size  of  f (X  ) and  f (X  ) , and  the 

1 j 2 j 

costs  of  each  type  of  misclassif ica tion . In  general,  the 

density  functions  f and  f will  contain  unknown  parameter 

1 2 

vectors  T and  T , and  these  parameters  must  be  estimated 
1 2 

from  the  known  members  of  each  population.  The  parametric 

estimation  is  usually  performed  with  N.L.E.,  as  previously 
discussed. 


The  discriminant  analysis  will  further  reguire  the 
prior  probability  of  selecting  a member  of  V for  analysis. 


Pr  = n / (n  + n ) , 
1 11  2 


or,  when  population  sizes  are  unknown,  a sampling  estimate 
of  Pr  may  be  used.  If  no  sampling  information  is 

available,  and  the  population  sizes  are  unknown,  Pr^  is 


^ A 
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assumed  to  be  0.5 
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1 11 


Also,  the  costs  of  aisclassif ication,  c and  C , 

2 1 1 1 1 2 

aust  be  stated,  where  C is  the  cost  of  aisclassifying  *V~ 

2 1 1 2 

when  the  individual  is  actually  from  IT.  Without  loss  of 

1 


generality,  c 


HI 


and 


C , 
2 | 2 


the 


costs 


of 


correct 


classification,  are  taken  to  be  zero. 


Finally,  the  decision  rule  for  discrimination  is: 
classify  X as  a aenber  of  if 

j 1 


Pr  c f (X  ) > (1  - Pr  ) c f (X  ) , 
1 2| 1 1 j 1 1 | 2 2 j 


and  classify  X as  a aeaber  of  "JT  otherwise.  This  minimizes 

j 2 

the  expected  cost  of  aisclassif ication. 


Clearly,  the  results  above  may  be  generalized  to  any 
nuaber  cf  populations.  Such  multiple  discriminant  analysis 
is  required  in  aachines  for  character  recognition  in  which  a 
hardware  autoaaton  carries  out  the  analysis  automatically  in 
a fascinating  way[  175].  Another  example  of  the  technique  is 
aultipnasic  screening  of  school  children  for  physical  and 
mental  defects  by  tests  and  inexpensive  profile 
measurements.  In  this  manner,  a single  property  set  is 
examined  in  order  to  classify  an  individual  as  healthy,  or 
medically  defective  in  any  of  several  ways.  It  is  assumed 
that  these  defects,  once  identified,  may  be  verified  with 
certainty  by  a more  thorough,  and  expensive  examination. 

In  contrast  to  the  discriminant , the  Bernoulli 
regression  model  is  not  concerned  with  classifying 
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observations  into  peraanent  populations  of  success  and 
failure,  but  rather  with  forecasting  the  probability,  p^,  of 

an  individual  achieving  a success.  The  implication  is  that 
repeated  trials  on  the  saae  individual  Mill  produce  soar* 
successes,  and  soae  failures,  and  that  the  properties  X^  are 

not  uniquely  those  of  a member  of  soate  population  of 
successes,  or  failures.  It  is  interesting  to  note  that  many 
applicaticns  of  discrininant  analysis  in  the  literature  are 
specious  for  just  this  reason. 
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To  proceed  with  Bernoulli  regression  one  lust  choose  a 
functional  form  for  p^  in  the  Bernoulli  density,  and  then 

estimate  any  unknown  paraaeters,  T,  in  this  function  using 

the  observations 


{*#!)# 


and  remembering  that  Bernoulli  regression  must  produce 
predictions  satisfying 


0 < p < 1 . 

j 


Among  the  mathematical  transformations  available  for 
our  use  are  a general  linear  model  with 


P = X T , 

3 3 


an  exponential 


anotner  exponential 


0 < X^T  < 1 ; 


= exp{-X^T}  , 
0 < X_T  < ®o  ; 
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* expi-[X^T]  1 » 
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the  logistic  function 


Pj  = [1  ♦ expi-X^T}  ] ; 


Urban's  tranformation 


-1  -1 

= 1/2  ♦ IT  tan  {X^T}  ; 


a trigonometric  model 


P = (1/2)  [1  ♦ sin  {X^T)  ] , 
-*^2  < X_T  < °ny 2 ; 


and  so  forth. 


There  are,  of  course,  an  infinite  number  of  candidates, 
as  in  any  regression  problem.  We  have  chosen  each  of  these 
to  contain  the  linear  form  X^T.  In  this  way,  there  is  a 

single  parameter  in  T associated  with  each  characteristic  in 

X.  This  permits  definition  of  x^=1  so  t^iat  t1  Bay  be 

interpreted  as  an  "intercept"  parameter  in  each  model. 

Also,  addition  and  deletion  of  characteristics  in  | may  be 
performed  easily;  this  facilitates,  for  instance, 
introduction  of  additional  variables  to  X,  to  allow  for 
nonlinear  interaction  of  characteristics.  Just  as  in  least 
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squares  regression,  it  is  perfectly  admissible,  and  useful, 
for  the  independent  variables  to  take  on  discrete  values 
(e.g.,  0,1).  Finally,  we  shall  discover  a salutary 
distributional  property  of  a wide  class  of  such  models,  and 
we  will  develop  a method  for  comparing  the  efficucy  of  two, 
or  more,  models  in  any  particular  problem. 


Note  that  several  of  the  models  given  require  a 

constraint  on  the  linear  fora  X T.  This  follows  from  the 

j 

"1-0"  constraint  on  and  the  desirability  of  providing  for 
to  be  stated  as  a single  valued  function  of  tne  argument 
X T.  Although  such  constraints  can  be  accomodated 

j 

numerically,  their  number  grows  directly  with  the  number  of 

observations.  For  ease  of  exposition,  we  choose  an 
unconstrained  model  for  complete  development  here.  That  is, 
the  transformation  used  mathematically  guarantees  a feasible 
probability,  p^. 


As  our  example,  we  will  use  M.L.  estimation  for  the 
logistic  transformation.  8erkson[20]  suggests  the  logistic 
function  for  bio-assay  models.  Also  see  the  presentation 
given  by  Finney[  84  ).  A development  of  L.S.  estimation  for  a 
similar  logistic  model  is  given  by  Mainer  and  Djncan[22b], 
He  rememDer,  thouyh,  that  the  L.S.  assumptions  do  not  lead 
to  tractacle  distributional  results,  while  the  K.L.E. 
approach  will  yield  excellent  large  sample  properties  with 
invariance. 

The  log  likelihood  for  the  parametric  Bernoulli  family 


L (P)  = 


= y .imp  ) ♦ (i  - y ,)  in  (i  - p ) ] 
3*1  3 j 3 j 
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since 


In  f(P^)/  * y^/P^  - (1  - y^)/0  - P^)  # 


we  see  that  regardless  of  the  parametric  fora  for  d^, 
Ei  L • ( P)  1=0  fcy  inspection.  Parameterization  with  the 
logistic  function 


gives 


P^  * [ 1 ♦ exp{-X^T}  ] , 


ln(p^)  = -In  (1  * exp  {-X^T} ) , 


In  (1  - py  = -X^T  - ln(1  ♦ exp(-X^T)) 


From  this  we  find  the  gradient 


V = f |[y  X4.exp(-XT}  - (1  - y )X  ] / [1  ♦ exp(-XT)])  , 

i pi  j ji  1 3 ji  3 


and  symmetric  Hessian  matrix 


fl(T)  = {*»  } » t (-X  x eip(-I  I)  / [1  ♦ exp{-X  T}]  ) 

lk  ji  jk  j j 


The  symmetric  information  matrix,  E[-g(T)  ],  is 
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E[ -h  ] * jfc  (x  x exp{-X  T)  / [1  ♦ exp{-X  T}]2) 
lk  ji  jk  1 j 


(Feaarkafcly,  froa  this  we  see  that  the  Newton-ha phson  and 
Pisher's  scoring  aethods  are  identical  for  this  aodel.) 


Once  an  N.L.E.  solution,  T,  has  ueen  deterained  for  a 
particular  problea,  the  invariance  property  gives  for  tne 
single  valued  function  p 


j 


A A 

Pj  (T)  = P j (T)  , 


and  the  Naxiaua  Likelihood  Estiaation  ot  the  Bernoulli 
regression  is  coaplete. 


1 
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We  can  derive  confidence  limits  for  the  par ameter  p by 

A j 

noting  that  the  tl.L.  estimators,  T,  arc  asymptotically 
normally  distributed  with  a variance  covariance  matrix  giv°n 
by  th  • Cramer-Hao  bound,  the  inverse  of  the  information 
matrix. 


A A -1 

V(T)  - E[-fl(T)  ] 

Using  the  logistic  model  as  an  example,  clearly  the  scalar, 

A 

v- 

is  a linear  combination  of  asymptotically  normally 
distributed  random  variaoles,  and  thus  is  asymptotically 
normally  distributed  with  parameters 


m 

10) 


X T 

1 


and 


m 

2 0) 


X ? X'  . 

r i 


Confidence  limits  for  X T are 

j 


B) 

1(j) 


1/2 

i z m , 
a/2  2 ( j) 


where  z is  the  appropriate  unit  normal  variate  for 

a/2 


no 


< .. ' — aa 


Aim -■ 


** ■ 


confidence  level  1-a.  Confidence  limits  for  p follow 

directly  by  application  of  the  logistic  function  to  the 

liaits  for  X T. 

j 


All  the  mathematical  transf  orma  tions,  {^(X  ,T), 

proposed  earlier  for  possible  use  in  bernoulli  regression 

were  chosen  to  be  single  valued  functions  of  the  scalar 
argument  X^T  to  provide  similar  results  in  general. 


To  test  the  contribution  of  a particular  parameter,  or 
? 

set  of  parameters,  T,  in  the  prediction  of  p^,  a likelihood 

ratio  hypothesis  test  developed  by  Neynan  and  Pearson[ 173  ], 

Wald[  224  ] and  given  in  flood  and  Graybilif  1 b9  ],  can  be  used. 

a 

As  an  example,  to  compare  p^(T)  and  (T) , with 


♦ ? 

T = (T,TJ  , 


A 

* 


obtain  T and  T,  and  compute  the  statistic 

A * 

-2  ln[L(T)  / L (T)  ] . 

As  n approaches  infinity,  the  statistic  is  asymptotically 
chi-square  with  degrees  of  freedom  equal  to  the  number  of 
? 

elements  in  T,  providing  sufficient  statistical  grounds  to 
give  a constructive  hypothesis  test. 


A stepwise  Bernoulli  regression  algorithm  can  be 
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defined  and  performed  by  introducing  each  of  the  variables 
singly  with  an  intercept  term,  determining  the  rt.L.E.  in 
each  case , and  keeping  the  characteristic  leading  to  tne 
greatest  likelihood  among  the  competing  two-variable  models. 
Sequential  steps  proceed,  selecting  one  additional  variable 
at  a time  by  the  criterion  of  maximum  likelihood 
contribution  amcng  all  remaining  individual  candidates.  The 
procedure  stops  either  when  all  variables  have  been 
included,  or  when  addition  of  another  variable  produces  a 
likei.inooa  ratio  less  than  the  value  of  the  chi-square 
integral  for  one  degree  of  freedom  and,  say,  95  percent 
confidence  (3 . 84 1 • ••) . 


To  choose  between  alternative  functional  forms  for  the 
regression  model,  for  instance  between  the  logistic  model 
and  the  Urban  transformation,  the  following  heuristic  is 
proposed:  perform  the  stepwise  algorithm  for  both 
mathematical  functions,  and  select  the  function  for  which 
the  final  likelihood  is  larger. 


g.  iN  £fiI2i£lIQN  OF  JjABOR  FORCE  PA|liCI£ATiON 


An  interesting  problem  to  which  Bernoulli  regression  is 
applicable  is  given  by  Solberg[ 214  ],  who  investigates  the 
propensity  of  feaale  heads-of-household  with  dependent 
children  to  join  the  lanor  force,  and  gives  extensive 
references  to  and  criticisms  of  published  analyses  using 
other  statistical  techniques. 

The  data  used  for  analysis  is  extracted  as  a subset  of 
the  March,  1970,  Person-Family  file  of  the  Current 

Population  Survey  conducted  by  the  United  States  Census 
Bureau;  the  data  selection  criteria  produces  2,819 

observations  of  female  family  heads  with  dependent  children 
present  who  are  in  the  civilian  non-institut ionalized 

population  with  a primary  source  of  income  not  gained  from 
self-employment  in  agriculture. 

In  order  to  study  only  individuals  who  can  reasonably 
be  expected  to  participate  in  the  labor  force  with  non-zero 
probability,  family  heads  over  70  years  of  age  are  deleted 
along  with  certain  identifiable  incomplete  entries,  leaving 
2,222  observations. 

Observation  j is  coded  y =1  if  the  head-of-househcld  is 

j 

working,  with  a job  but  not  working,  or  looking  for 
employment,  and  y^  = 0 for  those  heads  who  are  at  home,  in 

school,  unable  to  work,  or  have  other  reasons  for  not 

participating. 

Sixteen  independent  variables  are  defined  for  each 
observation  as  follows 
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x^  = 1,  an  intercept  term, 

x = earned  income  in  hundreds  of  dollars, 

2 

x = welfare  income  in  hundreds, 

3 

x^  = unemployment  compensation, 

x = other  income, 

5 

x ,x  ,x  = "0-1"  regional  codes  for  North-Central, 

6 7 8 

South-Central  and  Western  States 
respectively , 

x = "O-l"  SMSA  Labor  flarket  Control  Variable, 

9 

equal  to  one  for  families  located  in  SHSA 


10 


1 1 


12 


13 


14 


16 


Central  Cities, 

= number  of  children  present, 

= number  of  children  present  under  six  years 
of  age, 

= number  of  other  adults  present, 

= "0-1”  race  code,  equal  to  one  for  black 
head-of- household, 

= age  of  head-of-household, 

= age  squared, 

= years  of  education. 


A program  was  written  in  FORTRAN  to  access  the 
observations  on  a mass  storage  device,  providing  features  to 
select  any  desired  subset  of  the  observations,  scale,  or 

normalize  the  variables,  selectively  list  observations,  and 

A 

obtain  the  M.L.E.,  T,  for  either  the  logistic  or  Urban's 


transformation  for  p (T) , using  a second  order 

j 

representation  of  the  problem  and  numerically  bounded 
variables.  A high  resolution  ti;.er  provides  active  compute 


1 14 
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tine  statistics  for  the  host  coaputer,  an  IBM  360/67  - II 

operated  under  the  HVT  system.  The  object  program  was 

generated  by  the  FORTRAN-IV  (H)  compiler  with  code 

optimization,  requiring  a memory  region  ot  approximately 
200K  bytes. 

Other  program  features  include  an  automatic  stepwise 
introduction  of  variables  to  a given  minimal  fixed  model 
from  remaining  indicated  candidates,  with  sequential 
selection  made  on  the  basis  of  maximum  log  likelihood 
contribution,  and  termination  triggered  by  a likelihood 
ratio  hypothesis  test  successively  performed  at  each  step 
with  a level  of  significance  specified  by  the  user. 

Also,  a variance  covariance  matrix  is  given  for  any 

A 

designated  model  solution,  T,  by  use  of  the  inverse  Hessian 
Cramer-Rao  bound,  and  used  to  compute  confidence  intervals 
for  p^  (T)  for  specified  observations  in  the  original  data 

set,  or  other  source.  The  final  regression  model  is  applied 

to  the  data  and  a frequency  distribution  is  individually 
produced  for  observations  with  y =0  and  kith  y =1. 

j j 

In  our  analysis,  the  logistic  and  Urban's 
transformations  were  separately  applied  both  stepwise  and 
simultaneously  to  all  sixteen  variables,  a ten  variable 
subset  consisting  of 

{T,X.,  i=1, 2, 3, 4, 5, 9,11,  13,14,16}  , 

l 

and  an  eight  variable  subset  comprised  of 

(X,X^,  i=  1 , 2 , 3 , 5 , 1 1 , 1 3 , 1 4 , 1 6}  . 
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On  the  basis  of  the  log  likelihood  heuristic  proposed 
earlier,  and  on  the  apparent  reluctance  with  which  Urban’s 
transf orma ticn  producer  predicted  probabilities  near  zero, 
the  logistic  model  was  selected  for  further  detailed 
analyses . 


As  an  example  of  the  results,  the  eight  variable 
composite  gave  a stepwise  logistic  model  with  variables 
introduced  in  the  sequence  indicated  with  each  step: 


ilEE 

1 

2 

3 

4 

5* 

6 

7 


1 3 

2 

Ik 

5 

22 

24 

1.111  -.042 

. 162  -.016 

.300 

-.025  -.016 

.298 

.011 

.206  -.019 

.292 

.011 

-.507 

.399  -.018 

.289 

.013 

-.643 

-.539 

.945  -.019 

.287 

.012 

-.571 

-.690 

-.012 

.530  -.020 

.286 

.013 

-.545 

-.698 

-.012 

The  final  log  likelihood  for  this  model  is  -795.2.  The 
asterisk  indicates  the  six  variable  model  tor  which  a 95 
percent  likelihood  ratio  test,  with  critical  chi  square 
value  3.841,  would  terminate  with  log  likelihood  -797.9. 


Execution  time  for  this  run  includes  disk  access,  M.L. 
estimation,  comparison  and  output  of  seven  two-variable 
models,  each  with  2,222  observations,  six  three-variable 
models,  five  four- variable  models,  and  so  forth,  yielding  an 
aggregate  to  10  minutes,  14  seconds. 


For  specific  subproblems,  an  individual  M.L.  estimation 
is  not  permitted  by  the  program  to  require  more  than  ten 
iterations.  This  bound  was  never  exercised  by  the  Bernoulli 
models  discussed  here.  The  eight  variable  model  required  an 
average  at  all  levels  of  search  dimensionality  cf  4.1 
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It  is  important  to  note  the  remarkable  step  to  step 

A 

stability  of  individual  terns  in  T , with  the  exception  of 
the  intercept  tern.  This  clearly  shows  that  use  of  the 
previous  solution  as  a starting  value  for  successive 

1 iterations  can  greatly  accelerate  convergence  of  the  second 

order  representation  of  the  problen.  Exploitation  of  such 
behavior  in  nonlinear  estination  has  been  suggested  by 
Ross[ 207  ]. 

The  regression  predictions  for  p givea  for  the  2,222 
observations  by  the  final  logistic  nodel  are  given  in  the 
following  frequency  distribution 


FORECAST 

ACTUAL 

£ 

lz  C 

if! 

0-.  1 

139 

5 

. 1- . 2 

308 

37 

.2-.  3 

232 

54 

.3-.  4 

134 

60 

.4-. 5 

35 

65 

.5-. 6 

24 

61 

.6-.  7 

9 

51 

.7-.  8 

11 

62 

.8-.  9 

12 
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34 

772 

In  another  experiment,  a subset  of  400  observations  was 

A 

randosly  selected  and  T determined  without  stepwise 
introduction  of  variables  for  the  sixteen  variable  logistic 

model.  The  solution  of  this  pilot  model  was  then  used  as  a 

A 

startinq  value  for  computation  of  T for  all  2,222 
observations,  with  a total  computation  time  of  3 minutes,  21 
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seconds.  A direct  estimation  without  this  preliminary  step 
required  6 minutes,  8 seconds.  Constructive  stepwise 
estimation  or  all  sixteen  variables  with  no  pilot  models 
required  80  minutes,  42  seconds. 

Specification  of  the  appropriate  size  of  such  a pilot 
run  is  difficult,  since  a subset  too  small  will  qive  a 
solution  of  doubtful  value  (numerically  and  statistically) 
for  startinq  the  larger  model,  and  a subset  too  largo 
defeats  the  purpose  of  the  approach.  It  is  such  small 
sample  cases  that  exercise  the  numerical  bounds  and  other 
provisions  for  difficulties  in  estimation.  As  a rule  oi 
thumb,  400  Bernoulli  observations  are  used  here  with  good 
success  for  "all  at  once"  models. 

Experience  with  all  these  models  indicates  that  the 
constructive  stepwise  approach,  possibly  begun  with  a 
minimum  model  based  on  the  investigator's  prior  experience, 
and  terminated  by  the  likelihood  ratio  test,  is  a generally 
reasonable  plan  of  attack.  Although  computation  time  can  be 
very  high  with  such  a method,  important  benefits  are  derived 
from  model  analysis  by  the  M.L.  estimation  of  subset  models 
in  the  course  of  solution.  For  instance,  subtle 
comcomitance  among  variables  may  be  detected  by  analysis  of 
intermediate  output  that  would  evade  detection  in  a final 
variance  covariance  matrix. 

Other  Bernoulli  regression  models  have  been  studied, 
includinq  prediction  of  the  probability  of  winning  a 
horserace,  based  on  handicap  data,  and  the  estimation  of  the 
probability  of  increase  in  stock  price  from  market  analysis 
and  financial  information  in  investment  survey  guides.  Most 
recently,  DeMont  and  White[70]  report  analysis  of  tactical 
data  from  tank  engagements  in  the  Arab-Israel  conflicts. 

Generalization  of  these  techniques  is  possible  to 
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accoaoda  te  binoaial,  or  Bultinonial  data  and  aodels  with 
dependent  observations,  although  such  work  regains  to  be 
done. 
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